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Preface 



We greeted the attendees of CIVR 2004 with the following address: “Taimid an- 
bhroduil failte a chur romhaiblr chuig Ollscoil Clrathair Bhaile Atlra Cliatlr agus 
chuig an triu Comhdhail Idirnaisiunta ar Aisghabhail Iomhanna agus Ffsean. 
Ta suil againn go mbeidlr am iontach agaiblr anseo in Eirinn agus go mbeidlr 
bhur gcuairt taitneamhnach agus sasuil. Taimid an-bhroduil go lrairithe failte a 
chur roimh na daoine on oiread sin tfortha difriula agus na daoine a thainig as i 
bhfad i gcein. Ta an oiread sin paipear curtlra isteach chuig an chomlrdhail seo 
go bhfuil caiglrdean na bpaipear agus na bpostaer an-ard ar fad agus taimid ag 
suil go mor le lrocaid iontach.” 

We were delighted to host the 3 rd International Conference on Image and 
Video Retrieval in Dublin City University. We hope that all attendees had a 
wonderful stay in Ireland and that their visits were enjoyable and rewarding. 

There were 125 papers in total submitted to the CIVR 2004 conference and 
each was reviewed by at least three independent reviewers. We are grateful to 
the 64 members of the technical programme committee and the 29 other revie- 
wers who completed these reviews and allowed us to put together a very strong 
technical programme. The programme included 4 invited keynote presentations 
from Nick Belkin, Shih-Fu Chang, Andrew Fitzgibbon and Joe Marks and we 
are very grateful to them for their incisive and thoughtful presentations. The 
programme also contained 29 paper presentations and 44 poster presentations 
as well as two other guest presentations. The programme committee chairs did 
an excellent job in putting together the schedule for this conference, and the 
local arrangements, finance and publicity chairs also put a lot of work into ma- 
king this conference happen. Special thanks should go to Cathal Gurrin and 
Hyowon Lee for the tremendous amount of hard work they put into supporting 
CIVR 2004. 

The CIVR 2004 conference was held in cooperation with the ACM 
SIGIR, the Irish Pattern Recognition and Classification Society, the EU FP5 
SCHEMA Network of Excellence, and the BCS IRSG, and we are grateful to 
these organizations for promoting the event. We are also indebted to Science 
Foundation Ireland whose financial support allowed many students and other 
delegates to attend. 
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Pattern Mining in Large-Scale Image and Video Sources 



Shih-Fu Chang 

Digital Video and Multimedia Lab, Department of Electrical Engineering, 
Columbia University, New York, NY 10027, USA 
sfchanggee . Columbia . edu 
http : / /www. ee . Columbia . edu/dvmm 
http : / /www. ee . Columbia . edu/~sfchang 

Abstract. Detection and recognition of semantic events has been a major re- 
search challenge for multimedia indexing. An emerging direction in this field 
has been unsupervised discovery (mining) of patterns in spatial-temporal mul- 
timedia data. Patterns are recurrent, predictable occurrences of one or more en- 
tities that satisfy statistical, associative, or relational conditions. Patterns at the 
feature level may signify the occurrence of primitive events (e.g., recurrent 
passing of pedestrians). At the higher level, patterns may represent cross-event 
relations; e.g., recurrent news stories across multiple broadcast channels or re- 
petitive play-break alternations in sports. Patterns in an annotated image col- 
lection may indicate collocations of related semantic concepts and perceptual 
clusters. 

Mining of patterns of different types at different levels offers rich benefits, in- 
cluding automatic discovery of salient events or topics in a new domain, auto- 
matic generation of alerts indicating unusual situations, and summarization of 
concepts structures in a massive collection of content. 

Many challenging issues emerge. What are the adequate representations and 
statistical models for patterns that may exist at different levels and different 
time scales? How do we effectively detect and fuse patterns supported by dif- 
ferent media modalities, as well as how to handle patterns that may have rela- 
tively sparse occurring frequencies? How do we evaluate the quality of mining 
results given its unsupervised nature? 

In this talk, I will present results of our recent efforts in mining patterns in 
structured video sequences (such as sports and multi-channel broadcast news) 
and large collection of stock photos. Specifically, we will discuss the potential 
of statistical models like Hierarchical HMM for temporal structure mining, 
probabilistic latent semantic analysis for discovering hidden concepts, a hierar- 
chical mixture model for fusing multi-modal patterns, and the combined explo- 
ration of electronic knowledge (such as WordNet) and statistical clustering for 
image knowledge mining. 

Evaluations against real-world videos such as broadcast sports, multi-channel 
news, and stock photos will be presented. Future directions and open issues 
will be discussed. 
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Computer Vision in the Movies: From the Lab to the 

Big Screen 



Andrew Fitzgibbon 

Visual Geometry Group, University of Oxford, U.K. 
awf ©robots .ox.ac.uk 
http : / /www. robots . ox . ac . uk/~awf 



Abstract. I will talk about recent and current work at Oxford on the automatic 
reconstruction of 3D information from 2D image sequences, and the applica- 
tions of this technology to robotic navigation, augmented reality and the movie 
industry. The results of our work have been used on such movies as the „Lord 
of the Rings"' and „Harry Potter" series, and in 2002 we received an Emmy 
award for our contributions to television. 

The „take-home" of this talk is in the story of how to move from leading- 
edge research ideas to reliable commercial-quality code which must perform 
under the rigid time constraints of the movie industry. 

After an introduction to the basic tools of 3D reconstruction, I will demon- 
strate how the development of robust procedures for statistical estimation of 
geometric constraints from image sequences has led to the production of a 
highly reliable system based on leading-edge machine vision research. I will 
talk about the special character of film as an industry sector, and where that 
made our work easier or harder. 

I shall conclude by talking about ongoing work in our lab, including vision 
and graphics as well as some visual information retrieval. 
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Image and Video Retrieval Using New Capture and 
Display Devices 



Joe Marks 

Mitsubishi Electric Research Laboratories (MERL), Cambridge, Massachusetts, USA 
http : / /www.merl . com 

Abstract. Given a standard camera and a standard display screen, image- and 
video-retrieval problems are well understood, if not yet solved. But what hap- 
pens if the capture device is more capable? Or the display device? Some of the 
hard problems might well be made easier; and some of the impossible problems 
might become feasible. 

In this talk I will survey several novel input and output devices being developed at 
MERL that have the ability to change the nature of image and video retrieval, espe- 
cially for industrial applications. These projects include: 

• The Nonphotorealistic Camera: By illuminating a scene with multiple flashes, 
discontinuity edges can be distinguished from texture edges. The identification 
of discontinuity edges allows for stylistic image renderings that suppress detail 
and enhance clarity. 

• The Fusion Camera : Images are captured using two cameras. A regular video- 
camera captures visible light; another videocamera captures far-infrared ra- 
diation. The two image streams can be combined in novel ways to provide en- 
hanced imagery. 

• Instrumenting Environments for Better Video Indexing: Doors and furniture in 
work environments are instrumented with cheap ultrasonic transducers. The 
ultrasonic signals of these transducers are captured by microphone, down- 
shifted in frequency, and recorded on the normal audio track alongside the 
captured video. The recorded audio signals are used to help index or search 
through the video. 

• Visualizing Audio on Video: A typical security guard can monitor up to 64 
separate video streams simultaneously. However, it is impossible to monitor 
more than one audio stream at a time. Using a microphone array and sound- 
classification software, audio events can be identified and located. Visual rep- 
resentations of the audio events can then be overlaid on the normal video 
stream, thus giving the viewer some indication of the sounds in a video- 
monitored environment. 

• A Cheap Location-Aware Camera: A videocamera is instrumented with a sim- 
ple laser-projection system. Computer-vision techniques are used to determine 
the relative motion of the camera from the projected laser light. By relating the 
motion to a known fixed point, absolute camera location can be ob- 
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tained. The captured video can thus be indexed by camera location in a cost- 
effective manner. 

• An End-to-End 3D Capture-and-Display Video System: Using a camera array, 
a projector array, and a lenticular screen, video can be captured and displayed 
in 3D without requiring viewers to wear special viewing glasses. 

• Image Browsing on a Multiuser Tabletop Display: Many image-retrieval and 
image-analysis tasks may be performed better by teams of people than by in- 
dividuals working alone. Novel hardware and software allow multiple users 
to view and manipulate imagery together on a tabletop display. 

I will also speculate on how some future changes to capture and display devices 
may further impact image and video retrieval in the not-too-distant future. 

The projects above represent the research ejforts of many members of the MERE 
staff. For appropriate credits, please visit the MERE web site, www.merl.com. 

Speaker bio: 

Joe Marks grew up in Dublin, Ireland, before emigrating to the U.S. in 1979. He 
holds three degrees from Harvard University. His areas of interest include computer 
graphics, human-computer interaction, and artificial intelligence. He has worked 
previously at Bolt Beranek and Newman and at Digital's Cambridge Research Labo- 
ratory. He is currently the Director of MERL Research. He is also the recent past 
chair of ACM SIGART and the papers chair for SIGGRAPH 2004. 
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Abstract. A crucial issue for research in video information retrieval (VIR) is 
the relationship between the tasks which VIR is supposed to support, and the 
techniques of representation, matching, display and navigation within the VIR 
system which are most appropriate for responding to those tasks. Within a gen- 
eral model of information retrieval as support for user interaction with infor- 
mation objects, this paper discusses how different tasks might imply the use of 
different techniques, and in particular, different modes of interaction, for „op- 
timal“ VIR within the different task environments. This analysis suggests that 
there will be no universally applicable VIR techniques, and that really effective 
VIR systems will necessarily be tailored to specific task environments. This in 
turn suggests that an important research agenda in VIR will be detailed task 
analyses, with concomitant specification of functionalities required to support 
people in accomplishment of their tasks. 
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Abstract. In this paper we present the results of a user study that 
was conducted in combination with a submission to TRECVID 2003. 

Search behavior of students querying an interactive video-retrieval sys- 
tem was analyzed. 242 Searches by 39 students on 24 topics were assessed. 
Questionnaire data, logged user actions on the system, and a quality mea- 
sure of each search provided by TRECVID were studied. Analysis of the 
results at various stages in the retrieval process suggests that retrieval 
based on transcriptions of the speech in video data adds more to the 
average precision of the result than content-based retrieval. The latter is 
particularly useful in providing the user with an overview of the dataset 
and thus an indication of the success of a search. 

1 Introduction 

In this paper we present the results of a study in which search behavior of 
students querying an interactive video-retrieval system was analyzed. Recently, 
many techniques have been developed to automatically index and retrieve mul- 
timedia. The Video Retrieval Track at TR.EC (TRECVID) provides test collec- 
tions and software to evaluate these techniques. Video data and statements of in- 
formation need (topics) are provided in order to evaluate video-retrieval systems 
performing various tasks. In this way, the quality of the systems is measured. 
However, these measures give no indication of user performance. User variables 
like prior search experience, search strategies, and knowledge about the topic 
can be expected to influence the search results. Due to the recent nature of au- 
tomatic retrieval systems, not many data are available about user experiences. 
We argue that knowledge about user behavior is one way to improve perfor- 
mance of retrieval systems. Interactive search in particular can benefit from this 
knowledge, since the user plays such a central role in the process. 

We study information seeking behavior of users querying an interactive video- 
retrieval system. The study was conducted in combination with a submission to 
TRECVID 2003 [1]. Data were recorded about user characteristics, user estima- 
tions of the quality of their search results, familiarity of users with the topics, 
and actions performed while searching. The aim of the study was to investi- 
gate the influence of the recorded user variables on the average precision of the 
search results. In addition, a categorization was made of the 24 topics that were 
provided by TRECVID. The categories show differences in user behavior and 
average precision of the search results. 
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2 Research Questions 

To gain knowledge about how user-related factors affect search in a state-of-the- 
art video-retrieval system, we record actions that users take when using such a 
system. In particular, we are interested in which actions lead to the best results. 
To achieve an optimal search result, it is important that a user knows when to 
stop searching. In this study we therefore measure how well users estimate the 
precision and recall of their search. 

It is possible that different topics or categories of topics lead to different user 
strategies and differences in the quality of the results. We compare the search 
behavior and search results of categories of topics. In sum, the main questions 
in the study are: 

1. What search actions are performed by users and which actions lead to the 
best search results? 

2. Are users able to estimate the success of their search? 

3. What is the influence of topic type on user actions and search results? 

3 The ISIS Video Retrieval System 

The video-retrieval system on which the study was performed was built by the 
Intelligent Sensory Information Systems (ISIS) group at the University of Am- 
sterdam for the interactive video task at TRECVID. For a detailed description 
of the system we refer to [1]. 

The search process consists of four steps: indexing, filtering, browsing and 
ranking. Indexing is performed once off-line. The other three steps are performed 
iteratively during the search task. The aim of the indexing step is to provide 
users with a set of high-level entry points into the dataset. We use a set of 
17 specific concept detectors developed by CMU for the TRECVID, such as 
female speech, aircraft and newsSubjectMonologue. We augment the high-level 
concepts by deriving textual concepts from the speech recognition result using 
Latent Semantic Indexing (LSI). Thus we decompose the information space into 
a small set of broad concepts, where the selection of one word from the concept 
reveals the complete set of associated words also. 

For all keyframes in the dataset low-level indexing is performed by computing 
the global Lab color histograms. To structure these low-level visual descriptions 
of the dataset, the whole dataset is clustered using k-means clustering with 
random initialization. The k in the algorithm is set to 143 as this is the number 
of images the display will show to the user. In summary, the off-line indexing 
stage results in three types of metadata associated with each keyframe: (1) the 
presence or absence of 17 high-level concepts, (2) words occurring in the shot 
extended with associated words and (3) a color histogram. 

After indexing, the interactive process starts. Users first filter the total corpus 
of video by using the indexing data. Two options are available for filtering: 
selecting a high-level concept, and entering a textual query that is used as a 
concept. These can be combined in an ’and’ search, or added in an ’or’ search. 
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The filtering stage leads to an active set of shots represented as keyframes, 
which are used in the next step, browsing. At this point in the process it is 
assumed that the user is going to select relevant keyframes from within the 
active set. To get an overview of the data the user can decide to look at the 
clustered data, rather than the whole dataset. In this visualization mode, the 
central keyframe of each cluster is presented on the screen, in such a way that 
the distances between keyframes are preserved as good as possible. The user 
interface does not play the shots as clips since too much time would be spend 
on viewing the video clips. 

When the user has selected a set of suitable images, the user can perform 
a ranking through query-by-example using the color histograms with Euclidean 
distance. The closest matches within the filtered set of 2,000 shots are computed, 
where the system alternates between the different examples selected. The result 
is a ranked list of 1,000 keyframes. 

4 Methods 

We observed search behavior of students using the video-retrieval system de- 
scribed in Sect. 3. The study was done as an addition to a submission to 
TRECVID. Apart from the search results that were collected and submitted 
to TRECVID, additional user-related variables were collected. 

For the TRECVID 24 topics had to be found in a dataset consisting of 60 
hours of video from ABC, CNN and C-SPAN. 21 Groups of students (18 pairs 
and 3 individuals) were asked to search for 12 topics. The topics were divided 
into two sets of 12 (topics 1-12 and topics 13-24) and assigned a set to each 
student pair. For submissions to TRECVID the time to complete one topic was 
limited to 15 minutes. Prior to the study the students received a three-hour 
training on the system. Five types of data were recorded: 

Entry Questionnaire. Prior to the study all participants filled in a question- 
naire in which data was acquired about the subject pool: gender, age, subject 
of study, year of enrollment, experience with searching. 

Average Precision. Average precision (AP) was used as the measure of qual- 
ity of the results of a search. AP is the average of the precision value obtained 
after each relevant camera shot is encountered in the ranked list [1]. Note 
that AP is a quality measure for one search and not the mean quality of 
a group of searches. AP of each search was computed with a ground truth 
provided by TRECVID. Since average precision fluctuates during the search, 
we recorded not only the average precision at the end of the search but also 
the maximum average precision during the search. 

Logfiles. Records of user actions on the system were made containing the fol- 
lowing data about each search: textual queries, high-level features used, type 
of query (‘and’ or ‘or’), number of images selected, duration of the search. 
These data were collected at two points in time: at the end of the search and 
at the point at which maximum average precision was reached. The logfile 
data are used to answer the first research question. 
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Topic Questionnaire. After each search the participants answered 5 questions 
about the search: 1. Are you familiar with this topic? 2. Was it easy to get 
started on this search? 3. Was it easy to do the search on this topic? 4. Are 
you satisfied with your search results? 5. Do you expect that the results of 
this search contain a lot of non-relevant items (low precision)? All questions 
were answered on a 5-point scale (l=not at all, 5=extremely) . The resulting 
data were used as input for answering the second research question. 

Exit Questionnaire. After the study all participants filled in a short question- 
naire containing questions about the user’s opinion of the system and the 
similarity between this type of search and the searches that they were used 
to perform. 

To answer the third research question, the topics were categorized using a 
framework that was designed for a previous study [2]. The framework com- 
bines different methods to categorize image descriptions (e.g [3] and [4]) and 
divides queries into various levels and classes. For the present study we used 
only those distinctions that we considered relevant to the list of topics provided 
by TRECVID (Table 1): “general” vs. “specific” and “static” vs. “dynamic”. 
Other distinctions, such as “object” vs. “scene”, were not appropriate for the 
topic list since most topics contained descriptions of both topics and scenes. 



Table 1. Summary of topics, categorized into general and specific and into dynamic 
and static. See http://www.cs.vu.nl/~laurali/trec/topics.html for topic details. 



Class 


General 


Specific 


Static 


18: a crowd in urban environment 


09: the mercedes logo 




16: road with vehicles 


25: the white house 




14: snow-covered mountains 


07: tomb of the unknown soldier 




13: flames 


17: the sphinx 




01: aerial view of buildings 


24: Pope John Paul II 




10: tank 


04: Yassar Arafat 




22: cup of coffee 


20: Morgan Freeman 




23: cats 


15: Osama bin Laden 




06: helicopter 


19: Mark Souder 


Dynamic 


05: airplane taking off 


02: basketball passing down a hoop 




12: locomotive approaching you 


03: view from behind catcher while 




08: rocket taking off 

11: person diving into water 


pitcher is throwing the ball 



5 Subjects 

The subjects participating in the study were 39 students in Information Science 
who enrolled in the course Multimedia Retrieval at the University of Amsterdam. 
The number of years of enrollment at the university was between 1 and 8 (mean 
= 3.5). Two were female, 37 male. Ages were between 20 and 40 (mean=23.4). 
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Before the start of this study, we tested the prior search experience of the 
subjects in a questionnaire. All subjects answered questions about frequency 
of use and experience with information retrieval systems in general and, more 
specifically, with multimedia retrieval systems. It appeared that all students 
searched for information at least once a week and 92 % had been searching for 
two years or more. All students searched for multimedia at least once a year, and 
65 % did this once a week or more. 88 % of the students had been searching for 
multimedia for at least two years. This was tested to make sure that prior search 
experience would not interfere with the effect of search strategies on the results. 
We did not find any evidence of a correlation between prior search experience 
and strategy, nor between prior search experience and search results. The lack 
of influence of search experience can in part be explained from the fact that the 
system was different from search systems that the students were used to. All but 
three students indicated in the exit questionnaire that the system was not at all 
similar to what they were used to. All students disagreed with or were neutral 
to the statement that the topics were similar to topics they typically search for. 
Another possible reason for the absence of an effect of prior search experience is 
the three-hour training that all students had received before the study. 

The subjects indicated a high familiarity with the topics. Spearman’s cor- 
relation test indicated a relationship between familiarity and average precision 
only within topics 10 and 13. We do not consider this enough evidence that there 
is in fact a relationship. 



6 Results 

The data were analyzed on the level of individual searches. A search is the process 
of one student pair going through the three interactive stages of the system for 
one topic. 21 Groups of students searched for 12 topics each, resulting in 252 
searches. After exclusion of searches that were not finished, contained too much 
missing data, or exceeded the by TRECVID imposed maximum of 15 minutes, 
242 searches remained. 



User actions. In Table 2 descriptives are presented of the variables recorded 
in the logfiles. It shows that a search took approximately 8 minutes; 9 images 
were selected per search; high-level features were hardly used; or-search was used 
more than and-searclr. 

The mean average precision at the end of a search was 0.16. Number of 
selected images was the most important variable to explain the result of a search. 
This can be explained by the fact that each correctly selected image adds at least 
one relevant image to the result set. The contribution of the ranking to the result 
was almost negligibly small; change in AP caused by the ranking step had a mean 
of 0.001 and a standard deviation of 0.032. Number of selected images was not 
correlated to time to finish topic, number of features, or type of search. 

There was no correlation between time to finish topic and average precision, 
nor between type of search and average precision. Number of high-level features 
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Table 2. User actions in the system at the moment of maximum AP and at the end 
of the search 







Max 


End 




N 


Min. 


Max. 


Mean 


St.D. 


Min. 


Max. 


Mean 


St.D. 


Time (sec.) 


242 


0 


852 


345 


195 


6 


899 


477 


203 


No. of images selected 


242 


0 


30 


8.47 


7.01 


0 


30 


9.07 


7.06 


No. of high-level features 


240 


0 


5 


0.50 


0.84 


0 


17 


0.59 


1.39 


‘And’ or ‘Or’ search 


240 


And:75 Or:165 


And:82 Or: 158 



had a negative influence on the result. This is depicted in Fig. 1. The number 
of uses per features was too low to draw conclusions about the value of each 
feature. We can conclude, however, that selection of more than one feature leads 
to low average precision. To give an indication of the quality of the features that 
were used by the students, Table 3 shows the frequency of use and the mean 
average precision of the features. Only searches in which a single feature was 
used are included. 




number of features selected at end 



Fig. 1. Scatterplot of number of selected features and AP at the end of the search. 
One case with 17 features and AP of 0.027 is left out of the plot. 



User prediction of search quality. In the topic questionnaire we collected 
opinions and expectations of users on a particular search. All questions mea- 
sure an aspect of the user’s estimation of the search. For each question it holds 
that a high score represents a positive estimation, while a low score represents 
a negative estimation. Mutual dependencies between the questions complicate 
conclusions on the correlation between each question and the measured aver- 
age precision of a search. Therefore, we combined the scores on the 4 questions 
into one variable, using principal component analysis. The new variable that is 
thus created represents the combined user estimation of a search. This variable 
explains 70 % of the variance between the cases. Table 4 shows the loading of 
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Table 3. High-level features: mean average precision and standard deviation. 



Feature 


N 


Mean AP 


St.d. 


Feature 


N 


Mean AP 


St.d. 


Aircraft 


5 


0.09 


0.05 


People 


3" 


0.13 


0.15 


Animal 


5 


0.17 


0.06 


PersonX 


7 


0.14 


0.16 


Building 


2 


0.30 


0.00 


PhysicalViolence 


0 






CarTruckBus 


4 


0.11 


0.03 


Road 


3 


0.06 


0.04 


FemaleSpeech 


0 






Sport ingE vent 


9 


0.08 


0.03 


NewsSubjectFace 


1 


0.24 




Vegetation 


1 


0.13 




NewsSubjectMonologue 


1 


0.70 




WeatherNews 


0 






NonStudioSetting 


4 


0.15 


0.13 


Zoomln 


1 


0.08 




Outdoors 


15 


0.17 


0.20 











each question on the first principal component. Pearson’s correlation test showed 
a relationship between combined user estimation and actually measured aver- 
age precision. (Pearson’s correlation coefficient (Pcc) = 0.298, a — 0.01). This 
suggests that users are indeed able to estimate the success of their search. 



Table 4. Principal Component Analysis 



Questionnaire item 


Component 1 


easy to start search 


0.869 


easy to do search 


0.909 


satisfied with search 


0.874 


expect high precision 


0.678 



Another measure of user estimation of a search is the difference between 
the point where maximum precision was reached and the point where the user 
stopped searching. As mentioned in Sect. 6, the mean time to finish a search 
was 477 seconds, while the mean time to reach maximum average precision 
was 345 seconds. The mean difference between the two points in time was 128 
seconds, with a minimum of 0, a maximum of 704 and a standard deviation of 
142 seconds. This means that students typically continued their search for more 
than two minutes after the optimal result was achieved. This suggests that even 
though students were able to estimate the overall success of a search, they did 
not know when the best results were achieved within a search. A correlation 
between combined user estimation and time-after-maximum-result shows that 
the extra time was largest in searches that got a low estimation (Pcc = -0.426, 
a = 0.01). The extra 2 minutes did not do much damage to the precision. The 
mean average precision of the end result of a search was 0.16, while the mean 
maximum average precision of a search was 0.18. The mean difference between 
the two was 0.017, with a minimum of 0, a maximum of 0.48 and a standard 
deviation of 0.043. 
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Topic type. Table 5 shows that “specific” topics were better retrieved than 
“general” topics. The results of “static” topics were better than the results of 
“dynamic” topics. These differences were tested with an analysis of variance. 
The differences are significant far beyond the 0.01 a-level. We did not find any 
evidence that user actions were different in different categories. 



Table 5. Mean AP of topics types, and ANOVA results 



Mean AP 


Static 


Dynamic 


Total 


ANOVA results 


SS 


df 


MS 


F 


Sig. 


General 


0.12 


0.10 


0.11 


Between Groups 


0.426 


1 


0.426 


18.109 


0.000 


Specific 


0.27 


0.08 


0.22 


Within groups 


5.648 


240 


0.024 






Total 


0.19 


0.10 


0.16 


Total 


6.074 


241 









The change in AP caused by the ranking step was positive for general topics 
(mean = 0.005), while negative for specific topics (mean = - 0.004). For general 
topics we found a correlation between change in AP and AP at the end of the 
search (Pcc = 0.265, a = 0.004 ), which was absent for specific topics. 

7 Discussion 

Different types of topics result in differences in the quality of the search results. 
Results of “specific” topics were better than results of “general” topics. This 
suggests that indexing and filtering are the most important steps in the process. 
These steps are based on text retrieval, where it is relatively easy to find uniquely 
named objects, events or people. In content-based image retrieval on the other 
hand, and especially when the image is concerned as a whole, it is difficult to 
distinguish unique objects or people from other items of the same category. We 
are planning to upgrade the system so that regions within an image can be dealt 
with separately. Results of “static” topics were better than results of “dynamic” 
topics. This can be explained by the fact that the system treats the video data 
in terms of keyframes, i.e., still images. 

From the recorded user actions, number of selected images is by far the most 
important for the result. This is mainly caused by the addition of correctly 
selected images to the result set. The contribution of the ranking step to the 
average precision was almost negligibly small. We conclude from this that the 
main contribution of content-based image retrieval to the retrieval process is 
visualization of the dataset which gives the user the opportunity to manually 
select relevant keyframes. The visualization of the data set also gives the user 
an overview of the data and thus an indication of the success of the search. The 
results of the study show that users can estimate the success of a search quite 
well, but do not know when the optimal result is reached within a search. 

This study reflects user behavior on one particular system. However, the 
results can to a certain extent be generalized to other interactive video-retrieval 
systems. The finding that “specific” topics are better retrieved than “general” 
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topics is reflected by the average TRECVID results. The fact that users do not 
know when to stop searching is a general problem of category search [5], where 
a user is searching for shots belonging to a certain category rather than for one 
specific shot. One solution to this problem is providing the user with an overview 
of the dataset. Future research is needed to compare the effectiveness of different 
types of visualization. 

One of the reasons for this study was to learn which user variables are of 
importance for video retrieval, so that these variables can be measured in a 
future experiment. The most discriminating variable in the study proved to be 
the number of selected images. Further research is needed in which the optimal 
number of examples in a query-by-example is determined, taking in account the 
time spent by a user. In addition, future research is needed in which the four steps 
in the system are compared. In an experimental setting text-based retrieval and 
content-based retrieval can be compared. It would also be interesting to compare 
the results of an interactive video retrieval system to sequential scanning of shots 
in the data set for a fixed amount of time. 

One of the results was that prior experience with searching and familiarity 
with the topic do not affect the quality of the search results. The latter seems to 
indicate that background knowledge of the searcher about the topic is not used 
in the search process. Some attempts to include background knowledge into the 
process of multimedia retrieval are made (see for example [6,7]). We would be 
interested to see how these techniques can be incorporated in an interactive video 
retrieval system. 
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Abstract. With information systems, the real design problem is not increased 
access to information, but greater efficiency in finding useful information. In 
our approach to video content browsing, we try to match the browsing 
environment with human information processing structures by applying ideas 
from information foraging theory. In our prototype, video content is divided 
into video patches, which are collections of video fragments sharing a certain 
attribute. Browsing within a patch increases efficient interaction as other video 
content can be (temporarily) ignored. Links to other patches (“browsing cues”) 
are constantly provided, facilitating users to switch to other patches or to 
combine patches. When a browsing cue matches a user’s goals or interests, this 
cue carries a “scent” for that user. It is stated that people browse video material 
by following scent. The prototype is now sufficiently developed for subsequent 
research on this and other principles of information foraging theory. 



1 Introduction 

Humans are informavores: organisms that hunger for information about the world and 
about themselves [1], The current trend is that more information is made more easily 
available to more people. However, a wealth of information creates a poverty of 
directed attention and a need to allocate sought-for information efficiently (Herbert 
Simon in [2]). The real design problem is not increased access to information, but 
greater efficiency in finding useful information. An important design objective should 
be the maximisation of the allocation of human attention to information that will be 
useful. On a 1997 CHI workshop on Navigation in Electronic Worlds, it was stated 
[3] : “Navigation is a situated task that frequently and rapidly alternates between 
discovery and plan-based problem-solving. As such, it is important to understand each 
of the components of the task - the navigator, the world that is navigated, and the 
content of that world, but equally important to understand the synergies between 
them.” Information foraging theory [4] can be an important provider of knowledge in 
this regard, as it describes the information environment and how people purposefully 
interact with that environment. 

In this paper, we describe the design of a video-interaction environment based upon 
ideas from information foraging theory. The environment is developed to test these 
ideas in subsequent research, as will be explained at the end of this paper. 
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Finding video content for user-defined purposes is not an easy task: video is time- 
based, making interacting with video cumbersome and time-consuming. There is an 
urgent need to support the process of efficiently browsing video content. An orderly 
overview of existing video browsing applications and related issues can be found in 
[5]. Our approach adds a new perspective in that it applies a human-computer 
interaction theory to the problem of video content browsing. 

Emphasis is on browsing - and not on querying - for a number of reasons. To start 
with, people are visual virtuosos [6], In visual searching, humans are very good at 
rapidly finding patterns, recognising objects, generalising or inferring information 
from limited data, and making relevance decisions. The human visual system can 
process images more quickly than text. For instance, searching for a picture of a 
particular object is faster than searching for the name of that object among other 
words [7], Given these visual abilities, for media with a strong visual component, 
users should be able to get quick access to the images. In the case of video, its 
richness and time-basedness can obstruct fast interaction with the images, so efficient 
filter and presentation techniques are required to get access to the images. 

Except when the information need is well defined and easily articulated in a 
(keyword) query, browsing is an advantageous searching strategy because in many 
cases users do not know exactly what they are looking for. Well-defined search 
criteria often crystallise only in the process of browsing, or initial criteria are altered 
as new information becomes available. A great deal of information and context is 
obtained along the browsing path itself, not just at the final page. The search process 
itself is often as important as the results. Moreover, users can have difficulty with 
articulating their needs verbally, which especially applies in a multimedia 
environment, where certain criteria do not lend themselves well to keyword search. 
Furthermore, appropriate keywords for querying may not be available in the 
information source, and if they are available, the exact terminology can be unknown 
to the user [8], 

Browsing is a search strategy closer related to “natural” human behaviour than 
querying [9]. As such, a theory describing natural behaviour in an information 
environment may be very useful when designing a video browsing environment. 
Information foraging theory addresses this topic. 



2 Information Foraging Theory (IFT) 

In this paper, we describe the design of a video browsing environment based upon 
ideas from information foraging theory [4], IFT is a “human-information interaction” 
theory stating that people will try to interact with information in ways that maximise 
the gain of valuable information per unit cost. Core elements of the theory that we 
apply are: 

— People forage through an information space in search of a piece of information that 
associates with their goals or interests like animals on the forage for food. 

— For the user, the information environment has a “ patchy ” structure (compare 
patches of berries on berry bushes). 

— Within a patch, a person can decide to forage the patch or switch to another patch. 

— A strategy will be superior to another if it yields more useful information per unit 
cost (with cost typically measured in time and effort). 
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— Users make navigational decisions guided by scent, which is a function of the 
perception of value, cost, and access path of the information with respect to the 
goal and interest of the user. 

— People adapt their scent-following strategies to the flux of information in the 
environment. 

For applying IFT ideas to building an environment that supports efficient video 
browsing we need patches, and scent-providing links (browsing cues) to those 
patches. The patches provide structure to the information environment. Patches are 
expected to be most relevant when patches as defined in the database match with 
patches as the user would define them. To facilitate the use of patches, users need to 
be helped to make estimates of the gain they can expect from a specific information 
patch, and how much it will cost to discover and consume that information. These 
estimates are based on the user’s experience, but also on the information provision of 
browsing cues. 

The concept of scent provides directions to the design of information systems as it 
drives users’ information-seeking behaviour. When people perceive no scent, they 
should be able to perform a random walk in order to spot a trace of scent. When 
people perceive a lot of scent, they should be able to follow the trail to the target. 
When the scent gets low, people should be able to switch to other patches. These 
types of browsing behaviours all need to be supported in the design. Typical design- 
related situations can be that browsing cues are misleading (the scent is high, but the 
target is not relevant/interesting) or badly presented (no or low scent, but a very 
relevant or interesting target). 



2.1 Video Patches 

A video can be looked at as a database containing individual video fragments [10]. 
The original narrative of the video is “only” one out of many ways of organising and 
relating the individual items. People often want to structure the information 
environment in their own way, where the “decodings are likely to be different from 
the encoder’s intended meaning” [11 ]. 

Video patches are collections of video fragments sharing a certain attribute (see 
Figure 1). Attributes may vary along many dimensions, including complex human 
concepts and low-level visual features. Patches can form a hierarchy, and several 
combinations of attributes can be combined in a patch. 

As video fragments can have a number of attributes (e.g., a fragment can contain 
certain people, certain locations, certain events etc.), the fragments will occur in a 
number of patches. When viewing a fragment in a patch, links to other patches can be 
presented to the user. As such, patches form a hyperlinked network. 

Patches provide easy mechanisms to filter video content as users can browse a 
patch and ignore video fragments not belonging to the patch (see also Figure 3). What 
are good patches depends on issues like the user's task and the video genre. User 
experiments are needed to establish which patches are useful in which situation. 
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Fig. 1. Representation of video patches. The lower container is a database with video 
fragments (visualised as filmstrips), which can be the semantic units of a video. Fragments 
with the same attributes (here: squares, circles, triangles) are combined in video patches. The 
upper container is the browsing environment containing video patches (here: dark ellipses). 
When a fragment appears in two or more patches, links between these patches emerge (here: 
arrows between patches). 



2.2 The Scent of a Video Patch 

Certain video patches (or items within those patches) can have a semantic match with 
the user’s goal or interests, and as such, give off scent. A user’s goal or interest 
activates a set of chunks in a user’s memory, and the perceived browsing cues leading 
to patches (or patch items) activate another set of chunks. When there is a match, 
these browsing cues will give off scent. Users can find the relevant patches by 
following the scent. Scent is wafted backward along hyperlinks - the reverse direction 
from browsing (see Figure 2). Scent can (or should) be perceivable in the links - or 
browsing cues - towards those targets. Users act on the browsing cues that they 
perceive as being most semantically similar to the content of their current goal or 
interest (see also [12]). For example, the scent of menus influences the decision 
whether users will use them or start a query instead [13]. 

In terms of 1FT, The perception of information scent (via browsing cues) informs 
the decisions about which items to pursue so as to maximise the information diet of 
the forager. If the scent is sufficiently strong, the forager will be able to make the 
informed choice at each decision point. People switch when information scent gets 
low. If there is no scent, the forager will perform a random walk. What we are 
interested in is what exactly determines the amount of scent that is perceived. 

We have built an experimental video-browsing application (Figure 3) to study a) 
what are good attributes to create patches with, and b) how to present browsing cues 
in such a way they can present scent to the user. 
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Fig. 2. Schematic illustration of the scent of a video patch. One or more video fragments in the 
central patch semantically matches with the user’s goals or interests, and therefore can be 
considered a target. The browsing cues in surrounding patches that link to the target carry 
“scent”, which is wafted backwards along the hyperlinks. 



3 Design of a Video-Browsing Application 

We designed a prototype of a video browsing tool applying the ideas from IFT (see 
Figure 3). The aim of the prototype is to validate the applicability of IFT for browser 
design. For that purpose, the video that is browsed needs to be prepared in advance as 
a set of patches and scent-carrying cues. The richness of semantic cues that have to be 
taken into account goes beyond the current state of the art in research about 
automated feature extraction, structure analysis, abstraction, and indexing of video 
content (see for instance [14], [15], [16], [17]). The automated processes that are 
technologically feasible cannot yet accomplish the video browsing structure that is 
cognitively desirable. For the experimental prototype we prepared the video mostly 
by hand. 

In the prototype, scent-based video browsing will not be independent of querying. 
Smeaton [18] states that for video we need a browse-query-browse interaction. 
Querying helps to arrive at the level of video patches where the user needs to switch 
to browsing. Query-initiated browsing [19] is demonstrated to be very fruitful [20]. 
Golovchinsky [21] found that in a hypertext environment, (dynamic) query-mediated 
links are as effective as explicit queries, indicating that the difference between queries 
and links (as can be found - for example - in a menu) is not always significant. In the 
application described here, we see that the user can query the database and actually is 
browsing prefabricated queries. 
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Fig. 3. Interface of the patch-based video browsing application. Users can start interacting 
with the video by a) selecting a patch from the patch menu, b) querying video fragment 
information after which results are presented as a patch, or c) simply playing the video. When 
a patch is selected, the patch fragments - represented by boxes - are displayed on the video 
timeline, thus displaying frequency, duration, and distribution information of the fragments in 
the patch. To get more focussed information, A part of the patch is zoomed in on, and for 
these items keyframes are presented. Always one fragment in the patch is “active”. For this 
item, the top-left part of the interface presents 9 frames (when the item is not played or 
scrolled using the slidebar). All available information about the fragment is displayed 
(transcript, text on screen, descriptions). For the activated fragment it is shown in which other 
patches it appears by highlighting those patches in the patch menu. 



3.1 Patch Creation and Presentation 

When browsing starts and the user wants to select a first patch from the menu, the 
patch with the highest scent will probably get selected (“this is where I will probably 
find something relevant/of interest”). Which types of patches are created and the way 
they are presented (labelled) is here the central issue. 

In order to create patches, the video is first segmented into semantically 
meaningful fragments, which will become the patch items in the video patches. What 
are relevant units is genre-dependent. For a newscast, the newsitems seem to be the 
best units. For an interview, it may be each new question or group of related 
questions. For a football game, it may be the time between pre-defined “major events” 
(e.g. goals, cards, etc.). 

For each video unit, attributes are defined (in other words, what is in this fragment? 
What is this fragment about?). This step defines the type (or theme) of the video 
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patches, and as such, the main entries in the patch menu that will be formed. User 
inquiries are needed to find out what are the most useful attributes. For a football 
match, it may be players, goals, shots on goal, cards etc. For a newscast, this may be 
the news category, persons, locations, events, etc. 

Next, attribute values need to be defined (so, not just whether there is a person, but 
also who is that person). This step defines the specific patches that will be used in the 
browsing process. The main question here is which values will be most useful, which 
will depend on contexts of use. This is a typical example why much is done by hand: 
automatic feature extraction techniques currently have great difficulties performing 
this task. Data from closed captions, speech recognition, and OCR (optical character 
recognition), however, proved to be most helpful. 

Fragments sharing the same attribute values are combined into patches, and these 
patches are indexed. All this information about the video is stored in MPEG7 [22]. 



3.2 Support for Different Types of Browsing Behaviour 

As noted earlier, we distinguish three types of browsing behaviour: a random walk, 
within-patch browsing, and switching (between-patches browsing). 

For a random walk, users can play or browse the video as a whole, or browse the 
patch containing all video fragments (thus getting information for each fragment). 

For within-patch browsing, users can simply play the patch “hands-free”: at the 
end of each fragment the video jumps to the start of the next fragment in the patch. 
Alternatively, users can move the zooming window (using a slidebar) to scan the 
thumbnails representing the fragments in the patch. By clicking a thumbnail (or the 
neutral boxes representing a fragment), the user “activates” a fragment, thus 
displaying a lot of detailed information about that fragment. Users can use “next” and 
“previous” buttons to easily see detailed information about other individual 
fragments. As such, users can quickly scan within a patch what is or is not relevant 
(that is, what does and does not associate with their goals or interests). 

For every patch item, it is shown in which other patches it appears. Highlighting 
those patches in the patch menu indicates this. This also provides metadata about the 
current item. For example, when watching an item from the patch “politicians”, the 
item “drugs” may be highlighted in the menu, indicating what issue a politician is 
referring to. When the user is interested in video fragments on drugs, the user can 
simply switch to that patch. When the user is interested in opinions of politicians 
about drugs, the user may combine the two patches using the logical operator AND 
(or OR, if the user is interested in both). This way, users can influence the structure of 
the information environment in a way IFT calls “enrichment” [4]. Of course, users 
can always switch to any other patch in the patch menu. 

When the user wants to switch, the interface needs to present possibilities (that is, 
browsing cues) to switch to other patches. Reasons people want to switch may occur 
when: a) the current source has a scent below a certain threshold, or b) the user is 
inspired by what is seen within the current source (“I want to see more like this!”), or 
c) the browsing cues simply give off a lot of scent (cues point directly to sought-for 
information). If necessary, browsing cues can be presented dynamically, that is, 
query- or profile-dependent (see also [19]). 
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3.3 Scent Presented in the Interface 

When the user is browsing video data, the current patch or fragment will give off a 
certain scent via so-called “scent carriers”. Assuming that people are scent-followers, 
the main question here is: How can we present scent in such a way that users can 
efficiently and effectively browse video material? 

The title of the patch is the first scent carrier the user is confronted with. When the 
patch name semantically matches the user’s goals or interest, it will carry a certain 
amount of scent. The way patch names are organised and presented will influence the 
amount of perceived scent. 

When a patch is activated, several indicators will provide more or less scent, 
indicating to the user that “I’m on the right track”, or “I need to switch to another 
patch”. First of all, the frequency, duration, and distribution information of the 
fragments in the patch can provide a lot of scent (for example, when looking for the 
main people involved in a story, frequency information may be very useful). Still 
frames representing the fragments in the patch are the next scent carrying cues. For 
the one active fragment, a lot of scent carriers are available: keyframes (in this case, 
nine), transcript (derived from closed captions [when available] and/or speech 
recognition), displayed text in the video (optical character recognition), and added 
description. Of course, the video images- that can be either viewed or browsed by 
fast-forwarding or using a slider - can also carry scent by themselves. 

Regarding switching to other patches, the way indications about overlapping 
patches are presented will influence the scent people perceive. 



4 Conclusions and Future Work 

The practical problem we try to deal with is how people can interact with video 
content in such a way that they can efficiently pursue their goals. We translate this to 
the research problem of how we can match the information environment with human 
information processing structures. Information foraging theory is assumed to be a 
fitting theory to answer this question as it both describes how people perceive and 
structure the information environment, and how people navigate through this 
environment. Our prototypical browsing environment is based on the principles of 
this theory. Hypotheses we derive from the theory is that humans perceive the 
information environment as “patchy”, humans navigate through the information 
environment by following scent, and humans will interact with the environment in 
ways that maximise the gain of valuable information per unit cost. We applied these 
ideas to construct a solution for efficient interaction with video content, as described 
in this paper. The actual development took place in a few steps applying an iterative 
design apporach [23]. This is the starting point for our future research on the 
applicability of information foraging theory for the design of video interaction 
applications. As a first step we plan experiments to study the effectiveness of different 
presentations of browsing cues, how the perception of scent relates to different types 
of user tasks, how to support patch navigation, and which patches are useful in which 
situations. 
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Abstract. In this paper, we propose the use of the Maximum Entropy 
approach for the task of automatic image annotation. Given labeled 
training data, Maximum Entropy is a statistical technique which allows 
one to predict the probability of a label given test data. The techniques 
allow for relationships between features to be effectively captured, and 
has been successfully applied to a number of language tasks including 
machine translation. In our case, we view the image annotation task as 
one where a training data set of images labeled with keywords is pro- 
vided and we need to automatically label the test images with keywords. 
To do this, we first represent the image using a language of visterms and 
then predict the probability of seeing an English word given the set of 
visterms forming the image. Maximum Entropy allows us to compute the 
probability and in addition allows for the relationships between visterms 
to be incorporated. The experimental results show that Maximum En- 
tropy outperforms one of the classical translation models that has been 
applied to this task and the Cross Media Relevance Model. Since the 
Maximum Entropy model allows for the use of a large number of predi- 
cates to possibly increase performance even further, Maximum Entropy 
model is a promising model for the task of automatic image annotation. 



1 Introduction 

The importance of automatic image annotation has been increasing with the 
growth of the worldwide web. Finding relevant digital images from the web and 
other large size databases is not a trivial task because many of these images do 
not have annotations. Systems using non-textual queries like color and texture 
have been proposed but many users find it hard to represent their information 
needs using abstract image features. Many users prefer textual queries and au- 
tomatic annotation is a way of solving this problem. 

Recently, a number of researchers [2,4,7,9,12] have applied various statistical 
techniques to relate words and images. Duygulu et al. [7] proposed that the image 
annotation task can be thought of as similar to the machine translation problem 
and applied one of the classical IBM translation models [5] to this problem. 
Jeon et al. [9] showed that relevance models (first proposed for information 
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retrieval and cross-lingual information retrieval [10]) could be used for image 
annotation and they reported much better results than [7]. Berger et al. [3] 
showed how Maximum Entropy could be used for the machine translation tasks 
and demonstrated that it outperformed the classical (IBM) machine translation 
models for the English-French translation task. The Maximum Entropy approach 
has also been applied successfully to a number of other language tasks [3,14]. 

Here, we apply Maximum Entropy to the same dataset used in [7,9] and show 
that it outperforms both those models. We first compute an image dictionary 
of visterms which is obtained by first partitioning each image into rectangular 
regions and then clustering image regions across the training set . Given a training 
set of images and keywords, we then define unigram predicates which pair image 
regions and labels. We automatically learn using the training set how to weight 
the different terms so that we can predict the probability of a label (word) given 
a region from a test image. To allow for relationships between regions we define 
bigram predicates. In principle this could be extended to arbitrary n-grams but 
for computational reasons we restrict ourselves to unigram and bigram predicates 
in this paper. 

Maximum Entropy maximizes entropy i.e. it prefers a uniform distribu- 
tion when no information is available. Additionally, the approach automatically 
weights features (predicates). The relationship between neighboring regions is 
very important in images and Maximum Entropy can account for this in a nat- 
ural way. 

The remainder of the paper is organized as follows. Related work is discussed 
in section 2. Sections 3 provides a brief description of the features and image 
vocabulary used while the Maximum Entropy model and its application to image 
annotation are briefly discussed in 4 Experiments and results are discussed in 5 
while Section 6 concludes the paper. 

2 Related Work 

In image annotation one seeks to annotate an image with its contents. Unlike 
more traditional object recognition techniques [1,8,15,17] we are not interested 
in specifying the exact position of each object in the image. Thus, in image 
annotation, one would attach the label “car” to the image without explicitly 
specifying its location in the picture. For most retrieval tasks, it is sufficient to do 
annotation. Object detection systems usually seek to find a specific foreground 
object, for example, a car or a face. This is usually done by making separate 
training and test runs for each object. During training positive and negative 
examples of the particular object in question are presented. However, in the 
annotation scheme here background objects are also important and we have 
to handle at least a few hundred different object types at the same time. The 
model presented here learns all the annotation words at the same time. Object 
recognition and image annotation are both very challenging tasks. 

Recently, a number of models have been proposed for image annotation [2, 
4,7,9,12]. Duygulu et al [7] described images using a vocabulary of blobs. First, 




26 



J. Jeon and R. Manmatha 



regions are created using a segmentation algorithm like normalized cuts. For 
each region, features are computed and then blobs are generated by clustering 
the image features for these regions across images. Each image is generated by 
using a certain number of these blobs. Their Translation Model applies one of the 
classical statistical machine translation models to translate from the set of blobs 
forming an image to the set of keywords of an image. Jeon et al. [9] instead 
assumed that this could be viewed as analogous to the cross-lingual retrieval 
problem and used a Cross Media Relevance Model (CMRM) to perform both 
image annotation and ranked retrieval. They showed that the performance of the 
model on the same dataset was considerably better than the models proposed 
by Duygulu et al. [7] and Mori et al. [11]. 

The above models use a discrete image vocabulary. A couple of other models 
use the actual (continuous) features computed over each image region. This 
tends to give improved results. Correlation LDA proposed by Blei and Jordan 
[4] extends the Latent Dirichlet Allocation (LDA) Model to words and images. 
This model assumes that a Dirichlet distribution can be used to generate a 
mixture of latent factors. This mixture of latent factors is then used to generate 
words and regions. Expectation-Maximization is used to estimate this model. 
Lavrenko et al. proposed the Continuous Relevance Model (CRM) to extend the 
Cross Media Relevance Model (CMRM) [9] to directly use continuous valued 
image features. This approach avoids the clustering stage in in CMRM. They 
showed that the performance of the model on the same dataset was a lot better 
than other models proposed. 

In this paper, we create a discrete image vocabulary similar to that used 
in Duygulu et al [7] and Jeon et al. [9]. The main difference is that the initial 
regions we use are rectangular and generated by partitioning the image into a 
grid rather than using a segmentation algorithm. We find that this improves 
performance (see also [6]). Features are computed over these rectangular regions 
and then the regions are clustered across images. We call these clusters visterms 
(visual terms) to acknowledge that they are similar to terms in language. 

Berger et al. [3] proposed the use of Maximum Entropy approaches for vari- 
ous Natural Language Processing tasks in the mid 1990’s and after that many 
researchers have applied this successfully to a number of other tasks. The Max- 
imum Entropy approach has not been much used in computer vision or imaging 
applications. In particular, we believe this is the first application of the Maxi- 
mum Entropy approach to image annotation 



3 Visual Representation 



An important question is how can one create visterms. In other words, how does 
one represent every image in the collection using a subset of items from a finite 
set of items. An intuitive answer to this question is to segment the image into 
regions, cluster similar regions and then use the clusters as a vocabulary. The 
hope is that this will produce semantic regions and hence a good vocabulary. In 
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general, image segmentation is a very fragile and erroneous process and so the 
results are usually not very good. 

Barnard and Forsyth[2] and Duygulu et al. [7] used general purpose segmen- 
tation algorithms like Normalized-cuts[16] to extract regions. In this paper, we 
use a partition of the image into rectangular grids rather than a segmentation 
of the image. The annotation algorithm works better when the image is parti- 
tioned into a regular grid, than if a segmentation algorithm is used to break up 
the image into regions (see also [6]). This means that the current state of the 
art segmentation algorithms are still not good enough to extract regions corre- 
sponding to semantic entities. The Maximum Entropy algorithm cannot undo 
the hard decisions made by the segmentation algorithm. These segmentation 
algorithms make decisions based on a single image. By using a finer grid, the 
Maximum Entropy algorithm automatically makes the appropriate associations. 

For each segmented region, we compute a feature vector that contains visual 
information of the region such as color, texture, position and shape. We used 
K-means to quantize these feature vectors and generate visterms. Each visterm 
represent a cluster of feature vectors. As in Duygulu et al [7] we arbitrarily 
choose the value of k. In the future, we need a systematic way of choosing the 
value. 

After the quantization, each image in the training set can now be represented 
as a set of visterms. Given a new test image , it can be segmented into regions and 
region features can be computed. For each region, the visterm which is closest 
to it in cluster space is assigned. 

4 Maximum Entropy for Image Annotation 

We assume that there is a random process that given an image as an observa- 
tion generates a label y, an element of a finite set Y . Our goal is to create a 
stochastic model for this random process. We construct a training dataset by 
observing the behavior of the random process. The training dataset consists of 
pairs (aq, yi), (& 2 , 2 / 2 ), •••, (xn, Vn) where Xi represents an image and yt is a la- 
bel. If an image has multiple labels, Xi may be part of multiple pairings with 
other labels in the training dataset. Each image Xi is represented by a vector of 
visterms. Since we are using rectangular grids, for each position of the cell there 
is a corresponding visterm. 

4.1 Predicate Functions and Constraints 

We can extract statistics from the training samples and these observations should 
match the output of our stochastic model. In Maximum Entropy, any statistic is 
represented by the expected value of a feature function. To avoid confusion with 
image features, from now on, we will refer to the feature functions as predicates. 
Two different types of predicates are used. 

Unigram Predicate 

This type of predicate captures the co-occurrence statistics of a visual term 
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and a label. The following is an example unigram predicate that checks the 
co-occurrence of the label ‘tiger’ and the visterm V\ in image x. 



Jv i , tiger 



(x,y) 



1 if y = tiger and v\ € x 
0 otherwise 



(1) 



If image x contains visual term V\ and has ‘tiger’ as a label, then the value 
of the predicate is 1, otherwise 0. We have unigram predicates for every 
label and visterm pair that occurs in the training data. Since, we have 125 
visual terms and 374 labels, the total number of possible unigram predicates 
is 46750. 



— Bigram Predicate 

The bigram predicate captures the co-occurrence statistic of two visterms 
and a label. This predicate attempts to capture the configuration of the 
image and the positional relationship between the two visterms is important. 
Two neighboring visterms are horizontally connected if they are next to each 
other and their row coordinates are the same. They are vertically connected 
if they are next to each other and their column coordinates are the same. 
The following example of a bigram predicate models the co-occurrence of the 
label ‘tiger’ and the two horizontally connected visterms V\ and V2 in image 
x. 



{ 1 if y = tiger and x contains 

horizontally connected v 1 ,v 2 (2) 

0 otherwise 

If x contains horizontally connected visterms v\ and v 2 and ‘tiger’ is a label 
of x, then the value of the predicate is 1. We also use predicates that captures 
the occurrence of two vertically connected visterms. In the same way, we can 
design predicates that use 3 or more visterms or more complicated positional 
relationships. However, moving to trigrams or even n-grams leads to a large 
increase in the number of predicates and the number of parameters that must 
be computed and this requires substantially more computational resources. 

The expected value of a predicate with respect to the training data is defined 
as follow, 

P(f) =^2P(x,y)f(x,y) (3) 

x,y 

where p(x, y) is a empirical probability distribution that can be easily calculated 
from the training data. The expected value of the predicate with respect to the 
output of the stochastic model should be equal to the expected value measured 
from the training data. 

^2p(x,y)f(x,y) = ^2p(x)p{y\x)f{x,y) 

x,y x,y 



(4) 
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where p(y\x) is the stochastic model that we want to construct. We call equation 
4 a constraint. We have to choose a model that satisfies this constraint for all 
predicates. 

4.2 Parameter Estimation and Image Annotation 

In theory, there are an infinite number of models that satisfy the constraints 
explained in section 4.1. In Maximum Entropy, we choose the model that has 
maximum conditional entropy 

H(p) = -^2,p(x)p(y\x) log p{y\x) (5) 

x,y 

The constrained optimization problem is to find the p which maximizes H ( p ) 
given the constraints in equation 4. Following Berger et al [3] we can do this using 
Lagrange multipliers. For each predicate, fi, we introduce a Lagrange multiplier 
A j. We then need to maximize 

A(p, A) = H(p) +Y^ X i(P(fi) -P(fi)) ( 6 ) 

i 

Given fixed A, the solution may be obtained by maximizing p. This leads to 
the following equations [3]: 



p(y\x) 

*(A) 



^7T ex P 
Z(x) 



" k 

ifi(x,y) 

_i= 1 



-^2p(x)logZ(x) + ^2 ^iP(fi) 

X i 



( 7 ) 

(8) 



where Z(x) is a normalization factor which is obtained by setting ^2 y p(y \x) = 1. 

The solution to this problem is obtained by iteratively solving both these 
equations. A few algorithms have been proposed in the literature including Gen- 
eralized Iterative Scaling and Improved Iterative Scaling [13]. We use Limited 
Memory Variable Metric method which is very effective for Maximum Entropy 
parameter estimation [13]. We use Zhang Le’s [18] maximum entropy toolkit for 
the experiments in this paper. 



Table 1. Experimental results 



Experiment 


recall 


precision 


F- measure 


Translation 


0.04 


0.06 


0.05 


CMRM 


0.09 


0.10 


0.10 


Binary Unigram 


0.11 


0.08 


0.10 


Binary Unigram + Binary Bigram 


0.12 


0.09 


0.11 
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Fig. 1 . Examples: Images in the first row are the top 4 images retrieved by query 
‘swimmer’. Images in the second row are the top 4 images retrieved by query ‘ocean’. 



5 Experiment 

5.1 Dataset 

We use the dataset in Duygulu et al.[ 7] to compare the models. We partition 
images into 5x5 rectangular grids and for each region, extract a feature vector. 
The feature vector consists of average LAB color and the output of the Gabor 
filters. By clustering the feature vectors across the training images, we get 125 
visterms. 

The dataset has 5,000 images from 50 Corel Stock Photo cds. Each cd includes 
100 images on the same topic. 4,500 images are used for training and 500 are 
used for test. Each image is assigned from 1 to 5 keywords. Overall there are 
374 words (see [7]). 

5.2 Results 

We automatically annotate each test image using the top 5 words and then 
simulate image retrieval tasks using all possible one word queries. We calculate 
the mean of the precisions and recalls for each query and also the F-measure by 
combining the two measures using F = 1/(A -p + (1 — A)-^) where P is the mean 
precision, R is the mean recall. We set the A as 0.5. 

In this paper, we used the results of the Translation Model [7] and the 
CMRM[9] as our baseline since they also use similar features. The experiment 
shows that Maximum Entropy using unigram predicates has performance com- 
parable to the CMRM model (both have F-measures of 0.1). While one has 
better recall, the other has better precision. Both models outperform the classi- 
cal translation model used by [7]. Using unigram and bigram predicates improves 
the performance of the Maximum Entropy model. Our belief is that by using 
predicates which provide even more configuration information, the model’s per- 
formance can be further improved. 

Models which use continuous features (for example [12]) perform even better. 
Maximum Entropy models can also be used to model such continuous features 
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and future work will involve using such features. The results show that Maximum 
Entropy models have great potential and also enable us to incorporate arbitrary 
configuration information in a natural way. 

6 Conclusions and Future Work 

In this paper we show that Maximum Entropy can be used for the image an- 
notation task and the experimental results show the potential of the approach. 
Since, we can easily add new types of predicate ( this is the one of the nice prop- 
erties in Maximum Entropy ), there is great potential for further improvements 
in performance. More work on continuous valued predicates, image segmenta- 
tion techniques and feature extraction methods will also lead to performance 
improvements . 
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Abstract. The paper first addresses the main issues in current content-based im- 
age retrieval to conclude that the largest factors of innovations are found in the 
large size of the datasets, the ability to segment an image softly, the interactive 
specification of the user’s wish, the sharpness and invariant capabilities of fea- 
tures, and the machine learning of concepts. Among these everything gets better 
every year apart from the need for annotation which gets worse with every in- 
crease in the dataset size. Therefore, we direct our attention to the question what 
fraction of images needs to be labeled to get an almost similar result compared to 
the case when all images would have been labeled by annotation? And, how can 
we design an interactive annotation scheme where we put up for annotation those 
images which are most informative in the definition of the concept (boundaries)? 
It appears that we have developed an random followed by a sequential annotation 
scheme which requires annotating 1% equal to 25 items in a dataset of 2500 faces 
and non-faces to yield an almost identical boundary of the face-concept compared 
to the situation where all images would have been labeled. This approach for this 
dataset has reduced the effort of annotation by 99%. 



1 Introduction 

With the progress in content-based image retrieval, we have left behind the early years 
[12]. Some patterns in the development and use of image and video retrieval systems 
are visible. 

We give a short historical discussion on critical trends in the state of the art of 
content-based image retrieval and related topics: 

No longer papers on computer vision methods deal with a dataset size of 50, as 
was common in the nineties, but typically datasets of thousands of images are being 
used today. The data are no longer standardized during recording, nor are they precisely 
documented. In contrast, large datasets have brought general datasets, which are weakly 
documented and roughly standardized. As an immediate consequence of the amount, 
analysis software has to be robust. And, the sheer number of data quickly favors methods 
and algorithms that are robust against many different sources of variation. 

Another essential break through came when precise segmentation is unreachable 
and pointless for many tasks. Weak segmentation [12] strives towards finding a part of 
the object. To identify a scene, recognition of some details may suffice as long as the 
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details are unique in the database. Similarly, spatial relations between parts may suffice 
in recognition as long as the spatial pattern is unique. 

Interaction is another essential step forward in useful content-based retrieval. Many 
users have not yet completely made up their minds what they want from the retrieval 
system. Even if they do, they surely want to grasp the result in a context also presenting 
images of similar relevance. The common scenario of interaction in full breadth is 
described in [19]. A different point of view is to perceive relevance feedback as a method 
of instant machine learning. In this paper we lay the foundation for the later issue. 

Computer vision of almost any kind starts with features. Good features are capa- 
ble of describing the semantics of relevant issues amidst a large variety of expected 
scenes and ignoring the irrelevant circumstances of the recording. Where the 80s and 
90s concentrated on good features specific for each application, we now see a trend 
towards general and complete sets of features invariant to the accidental conditions of 
the recording while maintaining enough robustness and discriminatory power to distin- 
guish among different semantics. Invariance classes can be of salient-photometric nature 
[1 1], color-photometric nature [3,4], given by occlusion and clutter requiring local fea- 
tures, geometrically [16,8]. In a recent comparison [11] between the various features 
set on the effectiveness, the SIFT set came out as the best one [7], Robust, effective, 
narrow-invariant features are important as they focus maximally on the object-proper 
characteristics. 

The application of machine learning techniques takes away the incidental variations 
of the concept and searches for the common at the level of the object class. Many dif- 
ferent achievements have been made in object detection using tools of machine learning 
including the Support Vector Machines [9], boosting [18] and Gaussian mixture models 
[13]. 

All these items show progression every year. Segmentation has evolved into statis- 
tical methods of grouping local evidence. Interaction has absorbed relevance feedback, 
non-linear mapping on the screen and adaptable similarity measures. Features get more 
precise, more specifically invariant and powerful as well as more robust. Machine learn- 
ing requires less data and performs better even if the boundary is very complex. 

The foregoing shows that everything gets better all the time, apart from the amount 
of data. More data demand more effort in annotation up to the point where the data set 
gets so big that annotation is no longer feasible. Annotating thousands and eventually 
hundreds of thousands of pictures is hard to achieve, let alone the usual questions about 
the usefulness for one purpose and the reliability of such annotations. Where the machine 
power to make increasingly many computations is in place, the manpower for annotation 
will become the bottleneck. 

Therefore, in this paper we concentrate on the following questions: 

1 . What fraction of images needs to be annotated with a label (while the rest is used 
unlabeled) to arrive at an almost similar definition (of the boundary) of the concept 
in feature space compared to the case when all images would have been labeled? 

2. How to design a sequential selection rule yielding the most informative set of images 
offered to the user for annotation? 

3. Can concepts be learned interactively in CBIR? 
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In section 2 the optimal data selection mechanism is discussed in the framework of 
active learning. The section points out the advantage of the discriminative probabilistic 
models for active learning. In particular, we present two well-founded methods for data 
selection. The proposed concepts are illustrated in section 3 which shows the results of 
each of the methods for annotation of human face images in a database. 



2 Data Annotation Using Active Learning 

Consider the annotation of the images in a database: xi, . . . , sjv £ that is restricted 
to two classes with the labels +1, —1 meaning “interesting” and “not interesting”, or 
“object of a certain type” and “not such object” or whatever other dichotomy. In effect 
we study the case of a one class classifier where we want to learn one specific concept. 
If there are more concepts to learn, it has to be done in sequel. 

As labeling the entire database is considered impossible or at least inefficient, our 
aim is to label only a fraction of the images in the database denoted by 'D( . The remainder 
will stay unlabeled T> u . We will rely on an existing classifier indicated by y = sign f(x) 
which will perform the assignment of all images in the database to the class or to the 
non-class. 

The classifier needs to be trained on T>t, and its performance varies for different 
labeled sets. This poses the question how to collect the optimal T>( in term of the minimum 
number of images to label in order to arrive at a faithful best estimate of the concept 
boundary and the classification error. The problem is known as active learning [6]. It 
has been shown that a selective approach can achieve much better classifier than random 
selection. 

Typically, data are collected one by one. Intuitively, samples in T> u closest to the 
current classification boundary should be selected as they promise a large move of the 
boundary. We name this approach closest-to-boundary criterion. This heuristic approach 
has been used in active learning systems [6, 14] . Other more theoretical criteria, however, 
proved to be inefficient [15,1]. Explaining this phenomenon, we notice that most existing 
methods use Support Vector Machines (SVM) [17] as the base learning algorithm. This 
classifier exhibits good performance in many pattern recognition applications, and works 
better than the generative methods for small and not randomly selected training sets. 
However, being an optimization with inequality constraints, SVM is inefficient for the 
common task in active learning when one needs to predict the classifier parameters 
upon the arrival of new training data. We argue that discriminative models with explicit 
probabilistic formulation are better models for active learning. The most popular is 
regularized logistic regression [21]. Other alternatives also exist [20]. Training such 
models boils down to the estimation of the parameters of the class posterior probability 
p(y\x) which is assumed to have a known form. While the discriminative probabilistic 
models have similar performance as SVM, they need only unconstrained optimization, 
and furthermore, are suitable for probabilistic analysis. The later advantages make the 
models more flexible in meeting sophisticated criteria in data selection. 

The section presents two new active learning methods for the discriminative proba- 
bilistic models. They are better founded and has better performance than the closest-to- 
boundary method. 
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2.1 Data Selection by the Maximal Increase in Likelihood 



In the first example, we consider the selection method that increases the likelihood of 
the labeled set as much as possible. 

Suppose 9 is the parameter vector of the distribution p(y\x). In the case of logistic 
regression: 



P ^ 1 + exp{—y(a.x + b)} 



(1) 



and 9 = {a, b}, a £ b £ R. The parameters should be estimated via the minimiza- 
tion (maximization) of the regularized likelihood: 



£ = - In p(0) - Y l n P(xi,Vi) - Y ln P( x i) ( 2 ) 

iele ie Xu 

where Tg and I u denote the set of indices of labeled and unlabeled samples respectively. 
The prior probability p{9), usually defined as exp{— ^ ||a|| 2 }, is added as a regularization 
term to overcome numerical instability. The likelihood (2) can also be written as the sum 
of the likelihood of the class label: 



c [1) = - In p( 0 ) - Y ln P( x i\yii d ) (3) 

ie h 

and the likelihood of the data alone: 

£ (2) = - Y l n P(*i) (4) 

ieleUXn 

Only C 1 1 ' needs to be minimized since C' 2 ' does not depend on 9. Let 9 be the solution 
of the minimization of C 1 * and C be the value of the minimum. Observe that C can 
only be increased when new data are labeled. This motivates the approach that selects 
the sample that maximizes the increase of C will promote a quicker convergence of the 
learning. Moreover, such a criterion will result in a training set with high likelihood, 
which would produce more reliable parameter estimates. 

The increase of C when a prospective image x s is added to the labeled set, can be 
approximated analytically by the first step in Newton’s minimization method without 
redoing the complete minimization: 

AC s Ki -In p(y s \x s ; 0)- lnp(y s \x s ) T [V 2 £ - V 2 lnp(y s |a; s )] 1 Vlnp(y s |a; s ) 

(5) 



The selection criterion can be based on the expectation of AC S with respect to y s : 

s = axgmax E y [AC 8 \x 8 ] (6) 

E Va [AC s \x s ] =p(y s = 1| x s )A£+ +p{y s = -l|a: s )AC“ (7) 

where ACf and AC~ denote the values of AC S for y s = 1 and y s = — 1 respectively. 
The probability p(y s \x s ) can be approximated using the current estimate 9. Note that 
one could also use min{AC+, AC~} in place of the expectation without making an 
assumption on the accuracy of 9. In our current experiments, this criterion performs the 
same as eq. (6). Eq. (6) is therefore preferred as it is more attractive analytically. 
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2.2 Active Learning Using Pre-clustering and Prior Knowledge 

In this section, we show another active learning method where the discriminative proba- 
bilistic models are used to take the advantage of pre-clustering and prior data knowledge. 
In [2], Cohn et al. suggests a general approach to select the samples that minimize 
the expected future classification error: 

[ E [(y(x) - y(x)) 2 \x] p(x)dx (8) 

Jx 

where y(x) is the true label of x and y(x) is the classifier output. Although the approach 
is well founded statistically, the direct implementation is usually difficult since the data 
distribution p(x) is unknown or too complex for the calculation of the integral. A more 
practical criterion would be to select the sample that has the largest contribution the 
current error, that is, the expression under the integral: 

s = arg max E [(y s - y s ) 2 \x 8 ] p{x a ) (9) 

If no prior knowledge is available, one can only rely on the error expectation 
E [{y a - y s ) 2 \ Xg\ . Replacing the probability p(y s \x s ) by its current estimate, one can 
show that the error expectation is maximal for samples lying on the current decision 
boundary. More interesting, however, is the case where p(x) is known and non-uniform. 
The information about the distribution can then be used to select better data. One possi- 
bility to obtain p(x) is clustering which can be done offline without the interaction with 
human. The clustering information is useful for active learning due to the two following 
reasons: 

1 . The clustering gives extra information for assessing the importance of an unlabeled 
sample. In particular, the most important are the representative samples located in 
the center of the clusters, where the density p(x) is high. 

2. Samples in the same cluster are likely to have the same label. This is known as the 
cluster assumption. Its immediate application is that the class label of one sample 
can be propagated to the other samples of the same cluster. If so, the active learning 
can be accelerated as it is sufficient to label just one sample per cluster. 

The remainder of the section shows a framework for incorporating pre-clustering 
information into active learning. In the standard classification, data generation is de- 
scribed by the joint distribution p(x,y). The clustering information can be explicitly 
incorporated by introducing the hidden cluster indicator k £ {1, . . . , K}, where K is 
the number of clusters in the data, and k indicates that the sample belongs to the fc-th 
cluster. Assume that all information about the class label y is already encoded in the 
cluster indicator k. This implies that once k is known, y and x are independent. The 
joint distribution is written as: 

p(x , y, k ) = p(y\x, k)p(x\k)p(k) = p(y\k)p(x\k)p(k) (10) 

p(y\k) is the class probability of a whole cluster. Discriminative probabilistic models can 
be used to model this probability. In the reported experiment, we have used the logistic 
regression: 
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P(y\k) 1 -|- exp {— 2/( Cfc . a + 6)} 



( 11 ) 



where c/ c is the representative of the fc-th cluster which is determined by K-medoid 
clustering [5]. As such, p(y\k) specifies a classifier on a representative subset of the 
data. In the ideal case where the data is well clustered, once all the parameters of p(y\k) 
are determined, one could use this probability to determine the label of the cluster 
representatives, and then assign the same label to the remaining samples in the cluster. In 
practice, however, such classification will be inaccurate for samples disputed by several 
clusters. To better classify those samples, a soft cluster membership should be used. This 
is achieved with the noise distribution p(x\k) which propagates the information of label 
y from the representatives into the remaining majority of the data. In the experiment, 
p(x\k) are isotropic Gaussians with the same variance for all clusters. As such, p(x) is 
mixture of K Gaussians with the weights p(k). 

Given the above model, the class posterior is calculated as follows: 



K K 

P(y\x ) = ^2p{y,k\x) = ^p{y\k)p(k\x) 
k = 1 k - 1 



( 12 ) 



where p(k\x) = p(x\k)p(k) /p(x). As observed, the classification decision is the 
weighted combination of the classification results for the representatives with the weights 
p(k\x). Well clustered samples will be assigned the label of the nearest representative. 
Samples in between the clusters, on the other hand, will be assigned the label of the 
cluster, which has the highest confidence. 

The parameter estimation for the proposed model can also be done by minimizing 
the same likelihood functions defined in section 2. 1 . A minor difference is that C [ ' > now 
depends also on the parameters of the prior data distributions denoted by 6 ' : 



£ (1 ) 



K 



= - In p(0) - 1 \k;0)p(k\xi;0') 






K k = 1 



(13) 



However, the parameters 6' are determined mainly by maximizing since the unla- 
beled data are overwhelming over the labeled data. The maximization of each likelihood 
term can therefore be done separately. The clustering algorithm optimizes likelihood 
C 12 ’ to obtain the prior data distribution. The optimization of C il ) follows to obtain 

p(y\k). 



3 Experiment 

The proposed algorithms were tested to separate images of human face from non-face 
images in a database. We have used the database created in [ 10], which contains 30000 
face images of 1000 people and an equal number of non-face images. The images were 
collected randomly on the Internet and were subsampled to patterns of size 20x20. The 
face images are carefully aligned to cover the central part of the face. Each original 
face image is slightly transformed using rotation, scaling and translating, resulting in 
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Fig. 1. Example view of images in the test database. 






Fig. 2. The error curve for the classification of face images using the method proposed in section 
2. 1 . The method is also compared to two other methods that use the closest-to-boundary approach. 
m and ri 2 are the numbers of face and non-face images in the databases respectively. 



30 images per person. Example views of some images are shown in Figure 1 . From 
this full database, different test sets were extracted with different settings in number of 
images and proportions between the numbers of face and non-face images. Each image 
is regarded as a vector composed of the pixel grey values. 

The initial training set contains 5 face and 5 non-face images, and would be added 30- 
BO more images during the experiment. Each time a new sample is added to the training 
set, the classifier is re-trained and is tested on the rest of the database. The classification 
error is calculated as the number of wrong classifications including both missed face 
samples and false alarms. The performance is evaluated by the decrease of the error 
as the function of the amount of labeled data. Furthermore, to ensure the accuracy of 
the results, the error is averaged over number of times of running the algorithms with 
different randomly selected initial training sets. 

Figure 2 shows the performance curves for the method using the maximal increase 
in likelihood. Figure 3 shows the same curves for the method based on pre-clustering. 
In both figures, the new methods are compared to two other active learning algorithms 
which use the closest-to-boundary criterion. It is observed, for instance, in Figure 3a that 
the best classifier performance levels off to a 4% error. This is the result that would be 
obtained by labeling all images in the data base. The figure shows that the same result 
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a) 




Fig. 3. The results for the classification of face images using pre-clustering as proposed in section 
2.2. The proposed method is compared with two SVM-based active learning algorithms that use 
the closest-to-boundary and random selection. 



is achieved with only a small fraction of the database offered for interactive annotation 
leaving the remainder of the data unlabeled. 

4 Conclusion 

In all results, the new methods show better performance than the existing methods of 
active learning. That is to say less images need to be offered to the user for labeling 
to reach the same result. In effect as low as 1% equal to 25 images in the case of 
Figure 3, need to be labeled (the first 10 selected at random and the remaining 15 selected 
sequentially on the informativity to the concept boundary) to train the classifier on the 
concept. 

We have reason to believe the improvement tends to be more significant for even 
larger databases. 
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Abstract. Most image retrieval systems only allow a fragment of text 
or an example image as a query. Most users have more complex infor- 
mation needs that are not easily expressed in either of these forms. This 
paper proposes a model based on the Inference Network framework from 
information retrieval that employs a powerful query language that allows 
structured query operators, term weighting, and the combination of text 
and images within a query. The model uses non-parametric methods to 
estimate probabilities within the inference network. Image annotation 
and retrieval results are reported and compared against other published 
systems and illustrative structured and weighted query results are given 
to show the power of the query language. The resulting system both 
performs well and is robust compared to existing approaches. 



1 Introduction 

Many existing image retrieval systems retrieve images based on a query image [1] . 
However, recently a number of methods have been developed that allow images 
to be retrieved given a text query [2,3,4]. Such methods require a collection of 
training images annotated with a set of words describing the image’s content. 
From a user’s standpoint, it is generally easier and more intuitive to produce a 
text query rather than a query image for a certain information need. Therefore, 
retrieval methods based on textual queries are desirable. 

Unfortunately, most retrieval methods that allow textual queries use a rudi- 
mentary query language where queries are posed in natural language and terms 
are implicitly combined with a soft Boolean AND. For example, in current mod- 
els, the query tiger jet allows no interpretation other than tiger AND jet. 
What if the user really wants tiger OR jet? Such a query is not possible with 
most existing models. A more richly structured query language would allow such 
queries to be evaluated. 

Finally, an image retrieval system should also allow seamless combination of 
text and image representations within queries. That is, a user should be able to 
pose a query that is purely textual, purely based on an image, or some combi- 
nation of text and image representations. 



P. Enser et al. (Eds.): CIVR 2004, LNCS 3115, pp. 42-50, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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This paper presents a robust image retrieval model based on the popular 
Inference Network retrieval framework [5] from information retrieval that suc- 
cessfully combines all of these features. The model allows structured, weighted 
queries made up of both textual and image representations to be evaluated in a 
formal, efficient manner. 

We first give a brief overview of related work, including a discussion of 
other image retrieval methods and an overview of the Inference Network re- 
trieval framework in Section 2. Section 3 details our model. We then describe 
experimental results and show retrieval results from several example queries in 
Section 4. Finally, we discuss conclusions and future work in Section 5. 



2 Related Work 

The model proposed in this paper draws heavily from past work in the informa- 
tion and image retrieval fields. This section gives an overview of related work 
from both fields. A number of models use associations between terms and im- 
age regions for image annotation and retrieval. The Co-occurrence Model [6] 
looks at the co-occurrences of annotation words and rectangular image regions 
to perform image annotation. The Translation Model [7] uses a classic machine 
translation technique to translate from a vocabulary of terms to a vocabulary 
of blobs. Here, blobs are clusters of feature vectors that can be thought of as 
representing different “concepts”. An unannotated image is represented as a 
set of blobs and translated into a set of annotation words. The Cross-Media 
Relevance Model [3] (CMR.M) also views the task as translation, but borrows 
ideas from cross-lingual information retrieval [8] , and thus allows for both image 
annotation and retrieval. The Correspondence LDA Model [2] (CLDA) allows 
annotation and retrieval. It is based on Latent Dirichlet Allocation [9] and is a 
generative model that assumes a low dimensional set of latent factors generate, 
via a mixture model, both image regions and annotations. 

The motivation for the estimation techniques presented here is the Continu- 
ous Relevance Model [4] (CRMs), which is a continuous version of the CMRM 
model that performs favorably. Unlike other models that impose a structure 
on the underlying feature space via the use of blobs, the CRM model uses a 
non-parametric technique to estimate the joint probability of a query and an 
image. However, it assumes annotation terms are drawn from a multinomial dis- 
tribution, which may be a poor assumption. Our estimation technique makes 
no assumption as to the distribution of annotation terms and thus is fully non- 
parametric. 

The Inference Network retrieval framework is a robust model from the field of 
information retrieval [5] based on the formalism of Bayesian networks [10]. Some 
strong points of the model are that it allows structured, weighted queries to be 
evaluated, multiple document representations, and efficient inference. One par- 
ticular instantiation of the inference network framework is the InQuery retrieval 
system [11] that once powered Infoseek and currently powers the THOMAS 
search engine used by the Library of Congress. Additionally, inference networks 
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for multimedia retrieval in extensible databases have been explored [12]. Experi- 
ments have shown that intelligently constructed structured queries can translate 
into improved retrieval performance. Therefore, the Inference Network frame- 
work is a good fit for image retrieval. 

3 Image Inference Network Model 

Suppose we are given a collection of annotated training images T. Each image I £ 
T has a fixed set of words associated with it (annotation) that describe the image 
contents. These words are encoded in a vector tfi that contains the number 
of times each word occurs in /’ s annotation. We also assume that I has been 
automatically segmented into regions. Note that each image may be segmented 
into a different number of regions. A fixed set of d features is extracted from 
each of these regions. Therefore, a d dimensional feature vector r, is associated 
with each region of I. Finally, each feature vector r* extracted from / is assumed 
to be associated with each word in /’ s annotation. Therefore, tfi is assumed 
to describe the contents of each r* in I. Images in the test set are represented 
similarly, except they lack annotations. Given a set of unseen test images, the 
image retrieval task is to return a list of images ranked by how well each matches 
a user’s information need (query). 



3.1 Model 

Our underlying retrieval model is based on the Inference Network framework [5] . 
Figure 1 (left) shows the layout of the network under consideration. The node J 
is continuous-valued and represents the event an image is described by a collec- 
tion of feature vectors. The q Wj nodes are binary and represent the event that 
word Wj is observed. Next, the q rk nodes are binary and correspond to the event 
that image representation (feature vector) r*, is observed. Finally, the q op and 
I nodes represent query operator nodes. I is simply a special query operator 
that combines all pertinent evidence from the network into a single belief value 
representing the user’s information need. These nodes combine evidence about 
word and image representation nodes in a structured manner and allow efficient 
marginalization over their parents [13]. Therefore, to perform ranked image re- 
trieval, for each image X in the test collection we set J = X as our observation, 
run belief propagation, and rank documents via P(I\X), the probability the 
user’s information need is satisfied given that we observe image X. 

Figure 1 (right) illustrates an example instantiation of the network for the 
query #OR(#AND(tiger grass) ctiger . jpg>) . This query seeks an image with 
both tigers AND grass OR an image that is similar to tiger.jpg. The image 
of the tiger appearing at the top is the image currently being scored. The other 
nodes, including the cropped tiger image, are dynamically created based on 
the structure and content of the query. Given estimates for all the conditional 
probabilities within the network, inference is done, and the document is scored 
based on the belief associated with the #0R node. 
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Fig. 1. Inference network layout (left) and example network instantiation (right) 

Now that the inference network has been set up, we must specify how to 
compute P(q w \J), P(q r \J), and the probabilities at the query operator nodes. 
Although we present specific methods for estimating the probabilities within 
the network, the underlying model is robust and allows any other imaginable 
estimation techniques to be used. 

3.2 Image Representation Node Estimation 

To estimate the probability of observing an image representation r given an 
image J we use a density estimation technique. The resulting estimate gives 
higher probabilities to representations that are “near” (or similar to), on average, 
the feature vectors that represent image J. Thus, the following estimate is used: 



is a multivariate Gaussian kernel. Here, S is assumed to be diagonal and is the 
empirical variance with respect to the training set feature vectors. Note that any 
appropriate kernel function can be used in place of A f. 

3.3 Term Node Estimation 

Estimating the likelihood a term is observed given an image, P{q w \ J), is a more 
difficult task since test images are not labeled with annotations. Inverting the 




where |rj| is the number of feature vectors associated with J and, 



Af(x:u,E) = — , -.exp — -(x—u) T E 1 {x—l±) 

y/(2n) d \E\ L 2 V w V w 
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probability by Bayes’ rule and assuming feature vectors are conditionally inde- 
pendent given q w we see that 

P(Qw\J) oc P(q w )P(J\q w ) = P(q w ) P{ri\q w ) 

n£j 

where P(q w ) = n qw is the number of feature vectors in the training set q w is 
associated with and ?r toi is the total number of feature vectors in the training set. 
To compute P(ri\q w ) we again use density estimation to estimate the density of 
feature vectors among the training set that are associated with term q w . Then, 

P{ri\q w ) = — ^ ^N{ri\g k ,S) 

Uqw ieT g k ei 

tfi(q w )>0 

where tfi(q w ) indicates the number of times that q w appears in image P s an- 
notation and J\f is defined as above. Finally, it should be noted that when £ is 
estimated from the data our model does not require hand tuning any parameters. 

3.4 Regularized Term Node Estimates 

As we will show in Section 4, the term node probability estimates used here 
result in good annotation performance. This indicates that our model estimates 
probabilities well for single terms. However, when combining probabilities from 
multiple term nodes we hypothesize that it is often the case that the ordering 
of the term likelihoods captures more important information than their actual 
values. That is, the fact that tiger = arg ma,x qw P(q w \ J\) is more important than 
the fact that P(tiger\ J\) = 0.99. Thus, we explore regularizing the term node 
probabilities for each image so they vary more uniformly and are based on the 
ordering of the term likelihoods. 

Assuming the term likelihood ordering is important and that for an image: 
1) a few terms are very relevant (correctly annotation terms), 2) a medium 
number of terms are somewhat relevant (terms closely related to the annotation 
terms), and 3) a large number of terms are irrelevant (all the rest). Following 
these assumptions we fit the term probability estimates to a Zipfian distribution. 
The result is a distribution where a large probability mass is given to relevant 
terms, and a smaller mass to the less relevant terms. We assume that terms are 
relevant based on their likelihood rank , which is defined as the rank of term w 
in a sorted (descending) list of terms according to P(q w \J). Therefore, the most 
likely term is given rank 1. For an image J we regularize the corresponding term 
probabilities according to P(q w \J) = Z~ x — where R Qw j is the likelihood 
rank of term w for image J and Z~ Y normalizes the distribution. 

3.5 Query Operators 

Query operators allow us to efficiently combine beliefs about query word nodes 
and image representation nodes in a structure (logical) and/or weighted man- 
ner. They are defined in such a way as to allow efficient marginalization over 
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their parent nodes [14], which results in fast inference within the network. The 
following list represents a subset of the structured query operators available in 
the InQuery system that have been implemented in our system: #AND, #WAND, 
#0R, #N0T, #SUM, and #WSUM. To compute P(q op = true\J), the belief at query 
node q opi we use the following: 


P#AND(q op \J) = 

i 


P#w AN D{q op \J) = \\pi W 

i 


P#suM(q op \J) = ~ y ^Pi 

i 


P#WSUM{q op \J) = Yw iPi 

i 


P#oR(q op \J) = l - (l - Pi) 

i 


P#NOT(q op \J) = l — Pi 


where node q op has n parents 7Ti, . . . , 7r n , 


Pi = P{-Ki\J), and W = w i • See I 11 ) 



14,13] for a derivation of these expressions, an explanation of these operators, 
and details on other possible query operators. 

4 Results 

We tested our system using the Corel data set that consists of 5000 annotated 
images. Each image is annotated with 1-5 words. The number of distinct words 
appearing in annotations is 374. The set is split into 4500 training images and 500 
test images. Each image is segmented using normalized cuts into 1-10 regions. 
A standard set of 36 features based on color and texture is extracted from each 
image region. See [7] for more details on the features used. We compare the 
results of our system with those reported from the Translation, CMR.M and 
CRM models that used the same segmentation and image features. Throughout 
the results, InfNet-reg refers to the model with regularized versions of the term 
probabilities and InfNet refers to it with unregularized probabilities. 



4.1 Image Annotation 

The first task we evaluate is image annotation. Image annotation results allow us 
to compare how well different methods estimate term probabilities. The goal is to 
annotate unseen test images with the 5 words that best describe the image. Our 
system annotates these words based on the 5 terms with the highest likelihood 
rank for each image. Mean per-word recall and precision are calculated, where 
recall is the number of images correctly annotated with a word divided by the 
number of images that contain that word in the human annotation, and precision 
is the number of images correctly annotated with a word divided by the total 
number of images annotated with that word in the test set. These metrics are 
computed for every word and the mean over all words and are reported in Table 1. 
As the table shows, our system achieves very good performance on the annotation 
task. It outperforms CRMs both in terms of mean word precision and recall, with 
the mean per-word recall showing a 26.3% improvement over the CRM model. 
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Table 1. Annotation results 



Models 


Translation 


CMRM 


CRM 


InfNet 


# words w/ recall > 0 


49 


66 


107 


112 


| Results on full vocabulary | 


Mean per-word recall 


0.04 


0.09 


0.19 


0.24 


Mean per-word precision 


0.06 


0.10 


0.16 


0.17 



4.2 Image Retrieval 

For the retrieval task we create all 1-, 2-, 3- word queries that are relevant 
to at least 2 test images. An image is assumed to be relevant if and only if its 
annotation contains every word in the query. Then, for a query Q = qi,q%, . . . ,qL 
we form and evaluate a query of the form #and(g 1; ... , q^) . We use the standard 
information retrieval metrics of mean average precision (MAP) and precision at 
5 ranked documents to evaluate our system. Table 2 reports the results. 



Table 2. Retrieval results and comparison 



Query length 


1 word 


2 word 


3 word 


Number of queries 


179 


386 


178 


Relevant images 


1675 


1647 


542 


1 Precision after 5 retrieved images 1 


CMRM 


0.1989 


0.1306 


0.1494 


CRM 


0.2480 


0.1902 


0.1888 


InfNet 


0.2525 


0.1672 


0.1727 


InfNet-reg 


0.2547 


0.1964 


0.2170 


| Mean Average Precision j 


CMRM 


0.1697 


0.1642 


0.2030 


CRM 


0.2353 


0.2534 


0.3152 


InfNet 


0.2484 


0.2155 


0.2478 


InfNet-reg 


0.2633 


0.2649 


0.3238 



The regularized probabilities result in much better retrieval performance over 
the unregularized probabilities. Using these probabilities, our system achieves 
better performance than both the CMRM and CRM models on all 3 query sets 
for both the MAP and precision after 5 retrieved documents metric. 

Figure 2 gives illustrative examples of the top 4 ranked documents using the 
regularized estimates for several structured queries. The first query of Figure 2, 
#or ( swimmers jet), results in images of both swimmers and jets being returned. 
The next query shows that a standard query gives good retrieval results. The 
next two queries demonstrate how term weighting can affect the retrieval re- 
sults. Finally, the last query shows an example query that mixes text and image 
representations, with the bird image being part of the query. These results show 
that the structured operators can be used to form rich, powerful queries that 
other approaches are not capable of. 
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Fig. 2. Example structured query results 



5 Conclusions and Future Work 

We have presented an image retrieval system based on the inference network 
framework from information retrieval. The resulting model allows rich queries 
with structured operators and term weights to be evaluated for combinations 
of terms and images. We also presented novel non-parametric methods for esti- 
mating the probability of a term given an image and a method for regularizing 
the probabilities. Our system performs well compared to other published models 
under standard experimental evaluation. 

There are a number of things that can be done as part of future work. First, 
better estimates for P(q r \J) are needed. The method described in this paper 
was used for simplicity. Second, our system must be tested using different seg- 
mentation and better features to allow comparison against other published re- 
sults. Third, more rigorous experiments should be done using the structured and 
weighted query operators to show empirically what affect they have on overall 
performance. Finally, it would be interesting to explore a model that combines 
the current system with a document retrieval system to allow for full text and 
image search in a combined model. 

Acknowledgments. We thank Kobus Barnard for making the Corel dataset 
available at http://vision.cs.arizona.edu/kobus/research/data/eccv_2002/. 
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Abstract. We evaluate the small sample size (SSS) performance of evo- 
lutionary algorithms (EAs) for relevance feedback (RF) in image re- 
trieval. We focus on the requirement to learn the user’s information need 
based on a small — between 2 and 25 — number of positive and nega- 
tive training images. Despite this being a fundamental requirement, none 
of the existing works dealing with EAs for RF systematically evaluates 
their SSS performance. To address this issue, we compare four variants 
of EAs for RF. Common for all variants is the hierarchical, region-based 
image similarity model, with region and feature weights as parameters. 
The difference between the variants is in the objective function of the 
EA used to adjust the model parameters. The objective functions in- 
clude: (0-1) precision; (0-2) average rank; (0-3) ratio of within-class 
(i.e. , positive images) and between-class (i.e., positive and negative im- 
ages) scatter; and (0-4) combination of 0-2 and 0-3. We note that — 
unlike 0-1 and 0-2 — 0-3 and 0-4 are not used in any of the exist- 
ing works dealing with EAs for RF. The four variants are evaluated on 
five test databases, containing 61,895 general-purpose images, in 619 se- 
mantic categories. Results of the evaluation reveal that variants with 
objective functions 0-3 and 0-4 consistently outperform those with 0-1 
and 0-2. Furthermore, comparison with the representative of the existing 
RF methods shows that EAs are both effective and efficient approaches 
for SSS learning in region-based image retrieval. 



1 Introduction 

Modeling image similarity is one of the most important issues in the present 
image retrieval research [12,15]. When asked to retrieve the database images 
similar to the user’s query image(s), the retrieval system must approximate the 
user’s similarity criteria, in order to identify the images which satisfy the user’s 
information need. 

Given that the user’s similarity criteria are both subjective and context- 
dependent [12,17], the adaptation of the retrieval system to the user — through 
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the relevance feedback (RF) — is crucial [17]. The objective of RF is to improve 
the retrieval performance by learning the user’s information need based on the 
interaction with the user. RF is “shown to provide dramatic performance boost in 
[image] retrieval systems,” and even “it appears to have attracted more attention 
in the new field [image retrieval] than in the previous [text retrieval].” [17] 

Among a variety of machine learning and pattern recognition techniques 
applied to RF in information retrieval in general [17,6], probably the greatest 
imbalance between text and image retrieval is in the applications of evolutionary 
algorithms (EAs). Namely, while EAs are widely used and proven to be both 
effective and efficient approaches to RF in text retrieval [3,6], their applications 
to RF in image retrieval are still rare [17]. 

Our objective in this paper is to evaluate the performance of EAs for RF 
in image retrieval. We compare four variants of EAs for RF — two of the ex- 
isting ones, and the two new ones we introduce. We particularly focus on the 
small sample size (SSS) performance, i.e. , the requirement for a retrieval system 
to learn the user’s information need based on a small — up to a few tens 
number of positive (i.e., relevant for the user’s information need) and negative 
(i.e., irrelevant) training images. Despite this being a fundamental requirement 
in the RF context [17], none of the existing works dealing with EAs for RF 
systematically evaluates their SSS performance. 

The four variants of EAs for RF are evaluated on five test databases con- 
taining 61,895 general-purpose images, in 619 semantic categories. In total, over 
1,000,000 queries are executed, based on which the (weighted) precision, average 
rank, and retrieval time are computed. The comparison with the representative 
of the existing RF methods is performed as well. 

The main contributions of the present work are: (1) from the viewpoint of 
EAs for RF in image retrieval: (a) we perform the first systematic evaluation 
of their SSS performance; (b) we introduce two new EA variants, shown to 
outperform the existing ones; (2) from the viewpoint of RF in image retrieval in 
general: through the comparison with the existing RF methods, we demonstrate 
that EAs are both effective and efficient approaches for SSS learning in region- 
based image retrieval. 

The rest of the paper is structured as follows. In Section 2, the related works 
are surveyed, dealing with the RF and EAs in image retrieval. Next, in Sec- 
tion 3, the four variants of EAs for RF are introduced. Finally, in Section 4, the 
evaluation is described, of the SSS performance of the four variants. 

2 Related Works: RF and EAs in Image Retrieval 

Evolutionary algorithms. EAs refer to a class of stochastic, population-based 
search algorithms [9,5], including: genetic algorithms (GAs), evolution strategies 
(ESs), evolutionary programming (EP), etc. 

Regarding their desirable characteristics — from the viewpoint of application 
to the learning and/or optimization tasks — EAs are [9,5]: (1) global search al- 
gorithms, not using the gradient information about the objective function, which 
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makes them applicable to nondifferentiable, discontinuous, nonlinear, and multi- 
modal functions; (2) robust in that their performance does not critically depend 
on the control parameters; (3) easy to implement, since the basic mechanism of 
an EA consists of a small number of simple steps; (4) domain-independent, i.e. , 
“weak,” in that they do not use the domain knowledge; and (5) flexible in that 
they can equally handle both continuous and discrete variables to be optimized. 

Regarding the less desirable characteristics, EAs are in general: (1) not guar- 
anteed to find a globally optimal solution to a problem; and (2) computationally 
intensive. However, in practice [9], the sub-optimal solutions found by EAs are 
shown to be “sufficiently good” in most cases, while the parallel implementation 

natural for EAs — can solve the computation time problem. 

EAs in text and image retrieval. Among a variety of techniques applied to 
RF in information retrieval in general, probably the greatest imbalance between 
text and image retrieval is in the applications of EAs. 

On one hand, applications of EAs to RF in text retrieval are numerous ■ 
they are even covered in a well-known and widely used information retrieval 
textbook [6], and are also a subject of a recent comprehensive survey containing 
57 references [3]. On the other hand, works dealing with EAs for RF in image 
retrieval are still rare [2,13] — none is mentioned among the 64 references in a 
recent comprehensive survey of RF in image retrieval [17]. 

The issue not sufficiently addressed in the works dealing with EAs for RF in 
image retrieval is the SSS, despite this being a fundamental requirement in the 
RF context. As the objective functions — computed over the training images ■ 
either the retrieval precision [13], or the average rank of the positive images [2] 
are used, neither of which is suitable for the SSS learning. Furthermore, no work 
systematically evaluates the SSS performance of EAs for RF in image retrieval. 

Distinguishing characteristics of EAs for RF. In the context of RF in im- 
age retrieval, the distinguishing characteristics of EAs are: (1) unlike majority of 
other RF techniques performing the “transformation of the feature space” [17], 
EAs perform the “modification of the similarity model”; (2) arbitrary nonlin- 
earities or complex mathematical operators can be easily incorporated in the 
similarity model, without changing the EA used to adjust the model parameters; 
and (3) the learning algorithm is inherently incremental, suitable for long-term 
learning (i.e., creating user profiles) as well. 

Furthermore, the computation time of EAs is not a problem, since the number 
of training samples is small — as we experimentally demonstrate in Section 4. 

3 Four Variants of Evolutionary Algorithms for Relevance 
Feedback 

We introduce four variants of EAs for RF in image retrieval. Common for all 
variants is the hierarchical, region-based image similarity model, with region and 
feature weights as parameters (Section 3.1). The difference between the variants 
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is in the EA used to adjust the model parameters (Section 3.2), based on the 
positive and negative training images. 



3.1 Hierarchical Region-Based Image Similarity Model 

In the proposed image similarity model, image similarity is expressed as a 
weighted arithmetic mean of region similarities, while each region similarity is 
expressed as a weighted arithmetic mean of the corresponding feature similari- 
ties. Variables and functions used for the formalization of the proposed model 
are explained in the following. 

Set of images. I represents the image database. Area of each image is 
uniformly partitioned into N x N{= Ur) regions — i.e., rectangular blocks of 
equal size - defined by the set of regions R. From each region, uf features are 
extracted, defined by the set of features F. Based on the extracted features 
— e.g., color or texture — similarity of a pair of image regions is computed. 
A collection of (feature or region) similarity values are mapped into a single, 
overall (region or image) similarity value, using the weighted arithmetic mean. 

The hierarchical image similarity computation is expressed by the following 
equation (q is the query, i is the database image q, i G I): 



region similarity 




Regarding the number of regions (set R ) into which each image is partitioned, 
in the experiments we used N = 4, i.e., tir = N x N = 16 regions (Section 4). 
While the choice is heuristic, it results in a satisfactory retrieval performance, and 
other region-based image similarity models use a similar number of regions [15, 

13]- 

Regarding the image features (set F) extracted from each region, based on 
which feature similarity values Sjr(q, i; r; f) are computed, we have chosen three 
features most commonly used in the image retrieval [8]: color, shape, and texture. 
Three features imply that nj? = 3. Color features are represented by color mo- 
ments [14], resulting in a 9-D feature vector. Shape features are represented by 
edge- direction histogram [1], resulting in a 8-D feature vector. Texture features 
are represented by texture neighborhood [7], resulting in a 8-D feature vector. In 
total, each image region is represented by 9-D + 8-D + 8-D = 25-D feature vector. 
The feature similarity values Sj^q, i; r; f) are inversely proportional to the dis- 
tance between the corresponding feature vectors , which are computed using the 
(weighted) city-block distance [8] for all three features. 



3.2 Four Variants of EA for Adjusting Model Parameters 

The four EA variants are used to adjust the region and feature weights of the 
proposed similarity model, based on the positive and negative training images 
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provided by the user through the interaction. The difference between the four 
variants is in the objective function of the EA. The EA itself is the same for all 
variants — the evolution strategy (ES), with a modification we introduce. 

Evolutionary algorithm. We choose the ES among EAs for two reasons: 
(1) ESs are particularly suitable for the real parameter (in our case, weights) 
optimization [9]; and (2) ESs are the simplest and the easiest to implement 
among the EAs. 

We employ a two-membered evolution strategy — (1+1)ES — which is the 
basic ES. The underlying idea is to: (1) randomly generate an initial solution; 
and (2) iteratively generate new solutions by applying the stochastic mutation 
operator to the current solution. Whenever a new solution is generated, it com- 
petes with the current one, and replaces it in the case it is better as evaluated 
by the objective function. The process continues for a predefined number of 
iterations, or until a sufficiently good solution is found. 

We modify the original ES by combining two types of mutation — uniform 
random mutation and Gaussian random mutation. In general, the uniform mu- 
tation is more suitable for global search, while the Gaussian mutation is more 
suitable for local search [9]. To additionally emphasize these characteristics we: 
(1) set the mutation rate of the uniform mutation to a high value, and discretize 
the weight space on which the uniform mutation operates; and (2) set the mu- 
tation rate of the Gaussian mutation to a low value, while using the original, 
continuous weight space. Whenever the uniform mutation — performing global 
search — generates a new solution which is better than the current one, the 
Gaussian mutation is iteratively applied to the new solution, to perform local 
search, i.e. , “fine tuning.” In this way, the global and local search are more 
effectively combined than in the case of the original ES. 

Objective functions. As the objective functions — computed over the training 
images — we use: (0-1) precision; (0-2) average rank; (0-3) ratio of positive 
and negative scatter; and (0-4) combination of 0-2 and 0-3. 

Within each objective function, the training images are ranked based on the 
similarity to the centroid of positive images, using the image similarity function 
(Equation 1). Given np positive and njv negative training images, precision is 
computed as the fraction of positive images, among the top-ranked np images. 
Average rank is simply the average of the rank values for positive images. The 
positive and negative scatter are defined as the average similarity (computed 
using Equation 1) between the centroid of positive images, and: (1) each of the 
positive images, (2) each of the negative images — respectively. 

Functions 0-1 [13] and 0-2 [2] are used in the existing works dealing with 
EAs for RF. It follows from the definition of the two functions that 0-2 is more 
sensitive to a change in ranking among the training images — thus being more 
suitable for the SSS learning than 0-1. On the other hand, the newly introduced 
function 0-3 is inspired by the Biased Discriminant Analysis (BDA) [16], and is 
expected to be more suitable for the SSS learning than 0-2. 
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4 Experimental Evaluation of Small Sample Size 
Retrieval Performance 

4.1 Experimental Setting 

Test databases. Five test databases are used, containing 61,895 images, in 619 
semantic categories: (1) Corel-1000-A database [15], with 1,000 images in 10 
categories; (2) Corel-1000-B database [13], with 1,000 images in 10 categories; 
(3) Corel-20k-A database, with 20,000 images in 200 categories; (4) Corel- 
20k-B database, with 19,998 images in 200 categories; and (5) Corel-20k-C 
database, with 19,897 images in 199 categories. The number of images per 
category varies between 97 and 100 for the five databases. Corel-20k-A, Corel- 
20k-B, and Corel-20k-C databases are obtained by partitioning the Corel-60k 
database [15] into three approximately equal subsets. All the five databases are 
subsets of the Corel image collection [4], and contain color photographs, ranging 
from natural scenes to artificial objects. The Corel collection is well-known and 
frequently used for the evaluation of the image retrieval methods [15,13]. 

Partitioning of each database into semantic categories is determined by the 
creators of the database, and reflects the human perception of image similarity. 
The semantic categories define the ground truth in a way that, for a given 
query image, the relevant images are considered to be those — and only those 
— that belong to the same category as the query image. 

Performance measures. The performance measures we use are: (1) precision, 
(2) weighted precision, (3) average rank, and (4) retrieval time. These are the 
four most frequently used measures of the image retrieval performance [12]. All 
the four performance measures are computed for each query, based on the given 
ground truth. The performance measures are averaged for each image category, 
as well as for each test database. 

Queries. A query consists of a set of positive and negative training images. 
In the basic test cases, the number of positive (P) and negative (N) training 
images is varied between 1P+1N and 13P+12N, in 24 steps: 1P+1N, 2P+1N, 
2P+2N, 3P+2N, . . . , 13P+12N. Accordingly, there are 24 basic test cases. 

Each basic test case is executed 20 times for each image category, in each 
test database. Each time, a different set of positive and negative training images 
is randomly chosen, with positive images being from the given category, and 
negative from other categories. This gives (#categories x #basic test cases x 
20 trials) queries per test database, i.e., 619 x 24 x 20 = 297, 120 queries in total. 

In the additional test cases, the number of positive (i.e., negative) training 
images is fixed to 1, 2, and 3, respectively, while the number of negative (i.e., 
positive) training images is varied between 1 and 10, i.e.: 1P+1N, 1P+2N, . . . , 
1P+10N, 2P+1N, 2P+2N, . . . , 2P+10N, 3P+1N, 3P+2N, . . . , 3P+10N. This 
gives 60 additional test cases — 30 with small number of positive and increasing 
number of negative images, and 30 for the opposite situation. As for the basic 
test cases, each additional test case is executed 20 times for each image category 
in each test database, resulting in 619 x 60 x 20 = 742, 800 queries in total. 
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Comparison with existing RF methods. Besides comparing the retrieval 
performance of the four EA variants with objective functions O- 1-0-4, we also 
compare the best-performing of the four variants with the two of the existing RF 
methods Query Point Movement (QPM) method [10] and Standard Deviation- 
Based (StDev) method [11]. The two methods are chosen since they are easy to 
implement, and are shown to be both efficient and effective [10,11]. The image 
features used in the two methods are the same as those used in the proposed 
method (Section 3.1). 

4.2 Experiment Results and Discussion 

Experiment results are summarized in Figure 1. 



(a) Overall comparison of EA variants 



Perform. Objective Function 
Measure 0-1 0-2 0-3 0-4 

P. [%] 23.86 23.96 24.79 24.98 

W.P. [%) 30.45 30.60 31.96 32.15 

A.R. 3373 3346 3318 3298 




# training images (positive + negative) 

(c) Evaluation on Corel-20k-B DB 



(b) Best variant vs. existing methods 
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0-4 
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49.01 
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Corel- 1000-B 
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42.72 43.51 


Corel-20k-A 
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Corel-20k-B 
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Corel-20k-C 


6.13 


7.44 


9.71 




> S' x'i x 6, P p P p p ■P 

Oj X Oj x Oj x Oj x Oj x Oj X Oj x Oj X 

# training images (positive + negative) 



(d) Evaluation on Corel-20k-A DB 



Fig. 1. (a) Comparison of the four EA variants, with objective functions 0-1-0-4. For 
each variant, average values of retrieval precision (P.), weighted precision (W.P.), and 
average rank (A.R.) are given. Each value is an average of 297,120 queries, correspond- 
ing to the 24 basic test cases evaluated over the five test databases, (b) Comparison of 
the best-performing of the four EA variants (0-4), with the QPM and StDev methods. 
For each method, the average retrieval precision is given, evaluated on each of the five 
test databases, (c) Comparison of the four EA variants on Corel-20k-B database. For 
each variant, average retrieval precision for the 24 basic test cases is given. Each value 
is an average of 4,000 queries (200 categories x 20 trials per category), (d) Comparison 
of the four EA variants on Corel-20k-A database. For each variant, average retrieval 
precision for the 10 of the additional test cases is given. Each value is an average of 
4,000 queries (200 categories x 20 trials per category). 
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The average retrieval time per image for the four EA variants is in the range 
[O.lsec, 0.8sec], depending on the number of training images, evaluated on a 
750MHz Pentium III processor. 

Based on the experiment results, the following observations can be made: 

1. EA variants with objective functions 0-3 and 0-4 consistently outperfrom 
the variants with objective functions 0-1 and 0-2 (Figure 1-a, c, cl). This 
demonstrates that the newly introduced objective functions (0-3 and 0-4) 
are more suitable for the SSS learning than the existing ones (0-1 and 0-2). 

2. The objective function 0-4 — combining objective functions 0-2 and O- 
3 — results in the best performing EA variant. In general, EAs allow for 
straightforward combination of the objective functions. 

3. The best performing EA variant (0-4) consistently outperforms the repre- 
sentative of the existing relevance feedback methods, on all test databases 
(Figure 1-b). This is a consequence of both: (1) the proposed image similar- 
ity model, with the region and feature weights as parameters — as opposed 
to QPM and StDev methods with only the feature weights as parameters; 
and (2) the EA used to adjust the model parameters. 

4. Increasing the number of training images does not necessarily improve the 
retrieval performance (Figure 1-c, d), while the general conclusions are dif- 
ficult to draw regarding the effect of the number of training images on the 
retrieval performance. 

5 Conclusion 

We evaluated the small sample size (SSS) performance of evolutionary algorithms 
(EAs) for relevance feedback (RF) in image retrieval. 

We compared four variants of EAs for RF — two of the existing ones, and the 
two new ones we introduced. Common for all variants is the hierarchical, region- 
based image similarity model, with region and feature weights as parameters. 
The difference between the variants is in the objective function of the EA used 
to adjust the model parameters. The objective functions included: (0-1) preci- 
sion; (0-2) average rank; (0-3) ratio of within-class (i.e., positive images) and 
between-class (i.e., positive and negative images) scatter; and (0-4) combination 
of 0-2 and 0-3. We note that — unlike 0-1 and 0-2 0-3 and 0-4 are not 

used in any of the existing works dealing with EAs for RF. 

The four variants of EAs for RF were evaluated on five test databases contain- 
ing 61,895 general-purpose images, in 619 semantic categories. The comparison 
with the representative of the existing RF methods was performed as well. 

The main contributions of the present work are: (1) from the viewpoint of 
EAs for RF in image retrieval: (a) we performed the first systematic evaluation 
of their SSS performance; (b) we introduced two new EA variants, shown to 
outperform the existing ones; (2) from the viewpoint of RF in image retrieval in 
general: through the comparison with the existing RF methods, we demonstrated 
that EAs are both effective and efficient approaches for SSS learning in region- 
based image retrieval. 
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Abstract. Video retrieval compares multimedia queries to a video col- 
lection in multiple dimensions and combines all the retrieval scores into 
a final ranking. Although text are the most reliable feature for video 
retrieval, features from other modalities can provide complementary in- 
formation. This paper presents a reranking framework for video retrieval 
to augment retrieval based on text features with other evidence. We 
also propose a boosted reranking algorithm called Co-Retrieval, which 
combines a boosting type algorithm and a noisy label prediction scheme 
to automatically select the most useful weak hypotheses for different 
queries. The proposed approach is evaluated with queries and video from 
the 65-hour test collection of the 2003 NIST TRECVID evaluation. 1 



1 Introduction 

The task of content-based video retrieval is to search a large amount of video for 
clips relevant to an information need expressed in form of multimodal queries. 
The queries may consist merely of text or also contain images, audio or video 
clips that must be matched against the video clips in the collection. Specifically 
this paper focuses on the content-based queries which attempt to find semantic 
contents in the video collection such as specific people, objects and events. To find 
relevant clips for content-based queries, our video retrieval system needs to go 
through the following steps as indicated in Figure 1. First, various sets of index 
features are extracted from the video library through analysis of multimedia 
sources. Each video clip (or shot) is then associated with a vector of individual 
retrieval scores (or ranking features) from different search modules, indicating 
the similarity of this clip to a specific aspect of the query. Finally, these individual 
retrieval scores are fused via a weighted linear aggregation algorithm to produce 
an overall ranked list of video clips. 

It is of great interest to study the combination methods in the final step. 
This work considers approaches which rerank the retrieval output originally ob- 
tained from text features, using additional weak hypotheses generated from other 

1 This research is partially supported by Advanced Research and Development Activ- 
ity (ARDA) under contract number MDA908-00-C-0037 and MDA904-02-C-0451. 



P. Enser et al. (Eds.): CIVR 2004, LNCS 3115, pp. 60—69, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 




Co-retrieval: A Boosted Reranking Approach for Video Retrieval 



61 



Multi mu dal Queiy 



Text Aspect 




Audio Aspect 




Motion Aspect 


Image Aspect 


Osama 

Bm 

Laden 




¥m 









Script 

Irviex 



Video 

OCR 

Index 



Production 

MeUdate 

Index 



Audio 

Index 



Multiple Modality Video Collection Analysis 



i L 



i L 



Weighted Lineal' Fusion of Results 




Fig. 1 . Overview of our video retrieval system 



modalities as evidence. Text retrieval first finds a set of relevant shots for each 
query, with associated scores that define an initial ranking. The selected weak 
hypotheses are then weighted and linearly combined in an attempt to improve 
upon the initial ranking. 

Merely combining weak hypotheses with fixed weights or asking users to ex- 
plicitly set weights are either inflexible or unrealistic. It is desired for the system 
to learn the linear weights automatically. Put in another way, the system should 
be able to pick out the most related concepts without feedback from users. For 
example, for the query ’’finding people on the beach” the ideal system can choose 
outdoors and people features to improve the initial retrieval. To achieve this, we 
apply a boosting type algorithm, which repeatedly learns weak classifiers from 
a reweighted distribution of the training data and combines the weak classifiers 
into one composite classifier. We also provide a noisy label prediction scheme 
which allows it to improve the initial ranking without any external training 
data. Experiments applied the proposed approach to a video test collection of 
over 65 hours from the 2003 content-based video retrieval track [1], 

2 A Reranking Framework for Video Retrieval 

Based on evidence from the best-performing video retrieval systems in the 2001 
and 2002 NIST TREC Video Retrieval evaluation tasks, text features are demon- 
strated to be the most reliable ranking features for selecting semantically rel- 
evant shots in video retrieval. Text features span several dimensions including 
automatic speech recognition(ASR), closed captions(CC), video optical charac- 
ter recognition (VOCR) and production metadata such as published descriptions 
of the video. Typically a number of nearby shots are also retrieved since tempo- 
ral proximity can somehow indicate content closeness. However, ranking of video 
shots cannot simply rely on these text features. One important reason is that 
words may be spoken in the transcript when no associated images are present, 
e.g. a news anchor might discuss a topic for which no video clips are available. A 
reporter may also speak about a topic with the relevant footage following later 
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(a) 

(b) 



Fig. 2. The key frames of top 8 retrieved shots for query ’’Finding Tomb at Arlington 
National Cemetery”. These are retrieval results based on (a) text features (b) image 
features (c) text plus image features while filtering out news anchors and commercials 






Incorrect 




in the story, resulting in a major time offset between the word and the relevant 
video clips. As shown in figure 2(a), text retrieval will at times assign high scores 
to shots of studio settings or anchors which is usually not desirable. Moreover, 
word sense ambiguity may result in videos retrieved of the wrong meanings, e.g. 
either a river shore or a financial institution is possible for the word ’bank’. 
Speech recognition errors or VOCR errors may also result in incorrect retrieval. 
In general, retrieval using text features exhibits satisfactory recall but relatively 
low precision. 

Fortunately, many complementary sources from various video modalities can 
be used to rerank the text retrieval output. These sources includes audio fea- 
tures, motion vectors, visual features (e.g. color, edge and texture histograms), 
and any number of pre-defined high-level semantic features (e.g. a face detector 
and an anchor detector). Generally speaking, none of these can fully capture 
the full content of the shots and therefore retrieval results based only on these 
features are mostly unsatisfying. Figure 2(b) depicts the top results of image- 
only retrieval which returns nothing related to the query. However, these weak 
features can provide some indication of how closely the video shots are related 
to the given specific query examples. They can also filter out irrelevant shots 
such as anchorpersons or commercials. Figure 2(c) illustrates the advantage of 
weak features which can augment text retrieval by finding the similar objects 
and filtering news anchor plus commercial shots. 

These observations indicate that we should rely on text-based retrieval as the 
major source for answering semantic queries, while using the weak ranking func- 
tions from other modalities in combination to refine the ranking from the text 
retrieval. To be more general in the implementation, we convert the weak rank- 
ing features into a set of [-l,l]-valued weak hypotheses. Therefore, we propose 
the following re-ranking framework for video retrieval, 



F(x;, A) = A 0 F 0 (xi) + ^ AtM x i) 

t= 1 



(1) 
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where A t is the weight for the t th ranking function, Xi is the i th video shot, Fq ( • ) 
is the base ranking function generated by text retrieval and ht(-) is the output of 
the t th weak hypothesis. Without loss of generality, we assume Fo( x i) = 1 when 
shot Xi is found to be relevant by text retrieval, otherwise F 0 ( x i) = 0 . A 0 is 
typically set to be to max, (A,;) after A ; are learned. This allows Fq to dominate 
the retrieval results while the other weaker hypotheses ht re-rank and adjust the 
output provided by Fq. 



3 Co-retrieval: Boosted Reranking for Video Retrieval 

In this section, we propose the Co-Retrieval algorithm which combines a 
boosting-type learning approach and a noisy label prediction scheme to esti- 
mate the weights A without any external training data or user feedback. 



3.1 Boosted Reranking with Training Data 

Let us begin with considering the case when a number of training data {xi, yi} 
are available for each query, where yt £ {— l,+l},i = 1..IV. The goal of a 
learning algorithm is to find a setting for A which leads to better retrieval outputs 
than F 0 . Assuming all of Xi can be found by text retrieval, i.e. F 0 (x;) = 1, we 
only need to optimize F(xi,A) = ( x i) - In the current setting, the 

learning algorithm will produce a function Ft : X — > R which approximates the 
ordering encoded by the training data { Xi , y,}. Formally, the boosting algorithm 
can be designed to minimize a loss function related to ranking misordering, i.e. 
A = arg min a Loss({yi , F(x,, A)}). In analogy to classification problems, the loss 
function can be set to a monotonically decreasing function of margin yiF{xi , A), 
i.e. L ( yiF{xi , A)) . Two typical choices for function L are the exponential 

loss exp(— x) used by AdaBoost and the logit loss log(l + exp(— a:)) used by 
logistic regression. However, it has been argued that it is more reasonable to 
optimize ranking misordering through relative preferences rather than using an 
absolute labeling. Along this direction, the RankBoost algorithm proposed by 
Freund et al. [2] develops a boosting framework with ranking loss for combining 
multiple weak ranking functions. As a special case when the feedback function 
is bipartite, they provide an efficient implementation which actually minimizes 
the following ranking loss function, i.e. X^ y =+1 Yl y =_i 

To handle different loss functions, we apply a unified boosting type algorithm 
for learning the combination of weak hypotheses. The boosting algorithm shown 
in Figure 3 are modified from the parallel update algorithm proposed by Collins 
et al. [3] . For each round k, the algorithm first updates the distribution q ^ in 
a manner that increases the weights of examples which are misclassified. We 
will consider three loss functions in this step, i.e. exponential loss, logit loss and 
ranking loss. They have the following update rules respectively, 
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Input: Matrix M £ [— l,l] mxn where Mij = yihj(xi) and E"_i \Mij\ — 1 for all *■ 
N+ is the number of positive examples and N is the number of negative examples 
Output: F(xi, A) = EE=i At/it(xi), where Ai,..., A m optimize Loss({y, F(x)}). 

Algorithm: 

Let Ai = 0, qo = (0.5, 0.5, ...) 

For k = 1, 2, .... 

1. Compute distribution qk given M, 8k and qk-i 

2. For every positive example Xi, balance the distribution q k ,i = N~ qk,i/N + 

3. For j = 1, .., m : 

Wkj = '52i:sign(M i j)=+l 1k,i\ M ij\ 

W k,j = 

= I tog (Ki/Wkj) 

4. Update parameter: Ak+i = Ak + 8k 



Fig. 3. A unified boosting type algorithm with parallel-update optimization 



Qk+l,i 



Qk,i exp 



Qk,i 



(i 



Qk,i exp 



Z)"=l 

- q t ,i) exp (- E ; = 1 + q k ,i 

(~ Ej=i Ej^ji Qk,i exp Ej=i 






(2) 



A new step, i.e. step 2, is added to balance the distribution between positive and 
negative examples. In fact many boosting algorithms take some reweighting or fil- 
tering approaches to obtain a balanced distribution such as the RankBoost.B[2], 
otherwise a constant -1 hypothesis is the likely output from all weak hypothe- 
ses. Finally the update vector is computed as shown in step 3 and added to the 
parameters Ak- More details and the convergence proof can be found in [3]. 

The boosting algorithm requires access to a set of weak hypotheses h(-) 
produced from ranking features. The most obvious choice for weak hypotheses 
is equal to the normalized ranking features /j, i.e., h(x) = afi{x ) + b where a, b 
are constants to normalize h(x) to the range of [—1, +1]. If we only consider the 
relative ranking provided by the weak features instead of their absolute values, 
we can use the {-l,l}-valued weak hypotheses h of the form, i.e. h(x) = 1 if 
fi(x) > 8 otherwise h(x) = —1, where 0 G R is some predefined threshold. In 
our implementation, we use the first definition for the features which compute the 
distance between given image/audio examples and video shots in the collection, 
e.g. the Euclidean distance from a color histogram. For the features generated 
by semantic detectors such as face and outdoors detectors, we choose the second 
definition because their relative ordering makes more sense than an absolute 
value. Rather than learning the threshold 8 automatically, we prefer to fix the 
threshold to 0.5 in terms of posterior probability. 
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Table 1. The number of relevant shots ru in top k/n% shots returned by text retrieval 
which is averaged over 25 TREC03 queries. The number in () is ( rk/r n ) * 100% 



Top shots k/n% 


15% 


25% 


50% 


75% 


100% 


rk 


184(32.11%) 


248(43.8%) 


367(64.1%) 


487(84.7%) 


573(100%) 



3.2 Learning with Noisy Labels 

So far we assume training data are available for the boosting algorithm, however, 
collecting training data for every possible query topic on the fly is not feasible in 
general. Alternative approaches have to be developed to generate a reasonable 
weight assignment without requiring a large human effort to collect training 
data. Formally, considering the entire set of shots returned by text retrieval 
{x\ , ...,x n }, we need to assign a set of (noisy) labels y, which allows the 

boosting algorithm to improve the retrieval performance. 

Without loss of generality, let us assume {xi,...x n } are sorted in descend- 
ing order of text retrieval scores and denote rk the number of relevant shots in 
{x\, ...Xk}. By analyzing the characteristics of text retrieval, we make the fol- 
lowing assumption in the rest of this paper: The proportion of relevant shots in 
{xi, ...Xk} is higher than in the entire set, i.e. r>/fc > r n /n. In other words, the 
relevant shots are more likely to be higher ranked in the text retrieval. One par- 
tial explanation is that shots farther away from the content keyword location are 
lower ranked, with a lower probability of representing relevant concepts. Table 
1 provides more insights to support our assumption. Therefore we can simply 
assign {yi, ...yk} as +1 and {yk+ i, - - - , 3/ jv } as -1. In practice, we can augment the 
raw text retrieval scores with some highly accurate features to improve noisy 
label prediction, e.g. use anchor detectors to filter out irrelevant shots. 

However, because automatically generated training data is quite noisy, reg- 
ularization is generally required to reduce the effect of overfitting. Instead of 
introducing a penalty function into the loss function, we suggest two types of 
regularization approaches, 1. Use a y 2 test to select features with confidence 
interval 0.1; 2. Set A t to be 0 if A t < 0 for the nearest-neiglrbor-type features. 

3.3 Related Work 

Our approach builds on previous work which investigated the use of learning 
algorithms to improve ranking or retrieval. Collins et al. [4] considered a similar 
discriminative reranking approach to improve upon the initial ranking for natural 
language parsing. Tieu et al. [5] used boosting to choose a small number of 
features from millions of highly selective features for image retrieval. Blum et 
al. [6] proposed the co-training algorithm which trains a noise-tolerant learning 
algorithm using the noisy labels provided by another classifier. 

In [7] we described a related co-retrieval approach which also attempted 
to learn the linear weights of different modalities with noisy label prediction. 
However, the current work represents several improvements over the previous 
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algorithm: (1) The proposed algorithm is more efficient because it only trains on 
the top video clips provided by text retrieval instead of the whole collection; (2) 
It applies a unified boosting algorithm to select the most useful weak features 
with different loss functions; (3) An additional regularization step is added to 
avoid overfitting; (4) The positive and negative distributions are balanced before 
training; (5) It converts ranking features into further weak hypotheses. 

4 Experiments 

Our experiments followed the guidelines for the manual search task in the 2003 
NIST TRECVID Video Track 2003 [1], which require an automatic system to 
search without human feedback for video shots relevant to 25 query topics in a 
65-lrour news video collection. The retrieval units were video shots defined by a 
common shot boundary reference. The evaluation results are reported in terms 
of the mean average precision(MAP) and precision at top N retrieved shots. We 
generated 7 weak ranking features in our experiments including 4 types of general 
semantic features (face, anchor, commercial, outdoors), and 3 types of image- 
based features generated by the Euclidean distance of color, texture and edge 
histograms when query image examples were available. Detailed descriptions on 
the feature generation can be found in [7]. 

The following experiments consider two typical scenarios for video retrieval: 
1. when only keywords are provided we use only semantic ranking features; 2. 
when both keywords and image examples are provided we additionally use image 
ranking features. The co-retrieval algorithm works as follows: first return at most 
400 video shots using text retrieval as a base ranking function, label top a% shots 
as positive and others as negative 2 , learn the parameter A based on the noisy 
labels and feed this back to the reranking model. We set the number of rounds 
T to be 10000 and choose the best round using cross validation. A 0 is set to 
n \ max( A*|, where n is number of weak hypotheses. 

Figure 4 shows the performance improvement of Co- Retrieval without /with 
images examples over text retrieval alone. This improvement is achieved by suc- 
cessful reranking of top video shots. Table 2 lists a more detailed comparison for 
various retrieval approaches over mean average precision(MAP) at top 400 shots 
and precision at 10, 30 and 100 shots. Filtering out the anchor and commercial 
shots from text retrieval(Text/A/C) brings a slight performance improvement 
over text retrieval (Text). In contrast, Co- Retrieval with all three different loss 
functions (CoRet+ExpLoss, LogLoss, RankLoss) achieves a considerable 
and similar improvement over text retrieval in terms of all performance mea- 
sures, especially when image examples are available MAP increases 5%. To in- 
vestigate how noisy labels affect the results, we report the results of Co- Retrieval 
learning with truth labels(CoRet+Truth), which gives another 1.4% increase 
in MAP. This shows that the proposed algorithm is not greatly affected by the 

2 We augment noisy label prediction by reweighting shots identified as anchors or com- 
mercial from text retrieval scores. a% is simply set to 25%, because our experiments 
show that retrieval performance is not very sensitive to the choice of a%. 
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Fig. 4. The key frames of top 8 retrieved shots for query ’’Finding Tomb at Arling- 
ton National Cemetery”, (a) Retrieval on text features (b) Co-Retrieval w/o image 
examples (c) Co-Retrieval with image examples 

Table 2. Comparison between various retrieval approaches. See text for details 





Search w. Examples 


Search w/o Examples 


Approaches 


MAP 


PreclO 


Prec30 


PreclOO 


MAP 


PreclO 


Prec30 


PreclOO 


Text 


0.157 


0.292 


0.225 


0.137 


0.157 


0.292 


0.225 


0.137 


Text /A/C 


0.158 


0.304 


0.236 


0.146 


0.158 


0.304 


0.236 


0.146 


Global Oracle 


0.188 


0.368 


0.259 


0.16 


0.164 


0.336 


0.235 


0.152 


CoRet+ExpLoss 


0.206 


0.444 


0.307 


0.171 


0.177 


0.352 


0.261 


0.156 


CoRet+LogLoss 


0.208 


0.432 


0.3 


0.172 


0.178 


0.344 


0.263 


0.156 


CoRet +RankLoss 


0.207 


0.448 


0.301 


0.172 


0.178 


0.344 


0.26 


0.156 


CoRet+Truth 


0.222 


0.436 


0.325 


0.19 


0.189 


0.384 


0.28 


0.171 


Local Oracle 


0.285 


0.512 


0.344 


0.199 


0.212 


0.436 


0.304 


0.171 



overfitting problem typical with noisy labels. We also report the results of two 
oracles using the algorithms presented in [8]: An oracle of the single best com- 
bination weight for all queries (Global Oracle) and an oracle for the optimal 
combination weights per query (Local Oracle), which assumes all relevant shots 
are known ahead of time. This analysis shows that the Co- Retrieval consistently 
performs better than the theoretical optimal fixed-weight combination. 

5 Discussions 

Why not optimize the performance criterion directly, that is mean average pre- 
cision? Table 2 shows that there is a considerable performance gap between 
the local oracle and Co-Retrieval even with true labels. Therefore it is of inter- 
est to ask if we can optimize the performance criterion directly. However these 
performance criteria are usually not differentiable and not convex, which leads 
to several problems such as local maxima, inefficiency and poor generalization. 
Table 3(a) demonstrates the fact that maximizing mean average precision with 
noisy labels is not generalized enough to boost the true mean average precision. 
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Table 3. Comparison between various retrieval approaches when image examples are 
available, (a) Co-Retrieval maximizing ExpLoss vs. MAP; (b) Co-Retrieval with regu- 
larization, without regularization and with automatically learned weak hypotheses 





MAP 


PreclO 


Prec30 


PreclOO 


ExpLoss 


0.206 


0.444 


0.307 


0.171 


MAP 


0.182 


0.376 


0.265 


0.16 



(a) 





MAP 


PreclO 


Prec30 


PreclOO 


Reg 


0.206 


0.444 


0.307 


0.171 


NoReg 


0.192 


0.404 


0.293 


0.171 


More h 


0.171 


0.34 


0.236 


0.142 



(b) 



Why is boosting not overfitting? It is well known that boosting type algo- 
rithms are not robust to noisy data and exhibit suboptimal generalization ability 
in the presence of noise, because it will concentrate more and more on the noisy 
data in each iteration [9]. However, our boosting algorithm does not seem to be 
affected by the overfitting problem even if our training data contains a lot of 
noise. Two answers come to mind. First, the regularization step improves gener- 
alizability which intentionally puts constraints on the choice of parameters. Table 
3(b) compares the performance with and without regularization mentioned in 
Section 3.2 and shows that MAP will decrease about 1.4% without the regular- 
ization. Secondly, the version space of our weak hypotheses is much smaller than 
in most previous work such as [2], because we choose to fix the thresholds for 
weak hypotheses instead of learning these thresholds automatically. Table 3(b) 
shows how performance is much worse when the threshold is allowed to learn. To 
explain this, we utilize a theoretical analysis of boosting algorithms by Schapire 
et al. [9] . They claim that a bound on the generalization error P zr ^ d[p{z) < 0] 
depends on the VC-dimension d of the base hypothesis class and on the margin 
distribution of the training set. With probability at least 1 — <5, it satisfies, 

P*~d\p(z) < 0] < P z „ D \p(z) <d}+0^ ( dl ° S ^ /rf) + log(l /*))) . 

This analysis supports our observation that the generalization error will increase 
when the VC-dimension d becomes higher or equally the hypothesis space be- 
comes larger. Learning flexible thresholds allows the algorithms to achieve lower 
empirical results for noise-free labels, however, in the highly noisy case, reducing 
the hypothesis space turns out to be a better choice for learning. 

6 Conclusions 

This paper presents a reranking framework for video retrieval to augment re- 
trieval based on text features. We also propose a boosted reranking algorithm 
called Co-Retrieval, which applies the boosting type algorithm to automatically 
select the most useful weak hypotheses for different queries. Our experiments on 
the TRECVID 2003 search task demonstrates the effectiveness of the proposed 
algorithms whether or not image examples are available. Finally, we discuss two 
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issues of Co-Retrieval on the choice of loss functions and the overfitting problem 
of boosting. As a possible extension, we can consider adding a relevance feed- 
back function to the Co-Retrieval algorithm, which allows the interactive search 
system to rerank the current retrieval output given users’ relevance feedback. 



References 

1. TREC Video Track, “http://www-nlpir.nist.gov/projects/tv2003/tv2003.html,” . 

2. Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer, “An efficient boosting algorithm 
for combining preferences,” in Proc. of ICML-98, 1998, pp. 170-178. 

3. M. Collins, R. E. Schapire, and Y. Singer, “Logistic regression, adaboost and breg- 
man distances,” in COLT, 2000, pp. 158-169. 

4. M. Collins, “Discriminative reranking for natural language parsing,” in Proc. 17th 
Inti. Conf. on Machine Learning, 2000, pp. 175-182. 

5. K. Tieu and P. Viola, “Boosting image retrieval,” in Inti. Conf. on Computer 
Vision, 2001, pp. 228-235. 

6. A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-training,” 
in COLT, 1998. 

7. A. G. Hauptmann et al, “Informedia at trecvid 2003: Analyzing and searching 
broadcast news video,” in Proc. of (VIDEO) TREC 2003, Gaithersburg, MD, 2003. 

8. R. Yan and A. G. Hauptmann, “The combination limit of multimedia retrieval,” in 
Proc. of ACM Multimedia-03, 2003. 

9. G. Ratsch, T. Onoda, and K.-R. Muller, “Soft margins for AdaBoost,” Machine 
Learning, vol. 42, no. 3, pp. 287-320, Mar. 2001. 




HMM Model Selection Issues for Soccer Video 



Mark Baillie, Joemon M. Jose, and Cornelis J. van Rijsbergen 



Department of Computing Science, University of Glasgow, 
17 Lilybank Gardens, Glasgow, G12 8QQ, UK 
{bailliem, j j , keith}@dcs.gla.ac.uk 



Abstract. There has been a concerted effort from the Video Retrieval 
community to develop tools that automate the annotation process of 
Sports video. In this paper, we provide an in-depth investigation into 
three Hidden Markov Model (HMM) selection approaches. Where HMM, 
a popular indexing framework, is often applied in a ad hoc manner. We 
investigate what effect, if any, poor HMM selection can have on future 
indexing performance when classifying specific audio content. Audio is 
a rich source of information that can provide an effective alternative 
to high dimensional visual or motion based features. As a case study, 
we also illustrate how a superior HMM framework optimised using a 
Bayesian HMM selection strategy, can both segment and then classify 
Soccer video, yielding promising results. 



1 Introduction 

Live televised sporting events are now common place, especially with the arrival 
of dedicated digital channels. As a result, the volume of Sports video produced 
and broadcasted has increased considerably over recent years. Where such data 
is required to be archived for reuse, automatised indexing [2, 3, 5, 8, 9] is a viable 
alternative to the manual labour intensive procedures currently in practise. To 
date feasible solutions have not been developed. Current advancements, mainly 
the automatic identification of low level semantic structures, such as shot bound- 
aries [3], semantic units [5,9] and genre classification [8] can reduce both the time 
and workload for manual annotation. Also, recognition of such low level struc- 
ture is the basis for which further processing and indexing techniques can be 
developed. For example, labelling of low level segments can enable domain spe- 
cific indexing tools such as exciting event detection [2] to be enhanced, utilising 
prior knowledge of content. 

The difficulty with indexing Soccer video is that unrelated semantic com- 
ponents can contain visually very similar information, resulting in accuracy 
problems. For example, it is not uncommon for advertisements to display Sport 
sequences during televised events, to boost marketing appeal of a product, a 
potential source for error. However, audio is a rich, low dimension alternative to 
visual information that can provide an effective solution to this problem. 

In this paper we model audio content using the Hidden Markov Model 
(HMM), a popular indexing framework. The main thrust of this research is 
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to provide an in-depth investigation into HMM model selection, where HMM is 
largely applied in an ad hoc manner for video content indexing [5,8,9]. We also 
investigate what effect poor selection can have on future indexing accuracy. 

The remainder of this paper is structured as follows. In Section 2, we identify 
the potential factors that influence the application of a HMM. We then formally 
investigate three model selection strategies, in Section 3. As a case study, in 
Section 4, we illustrate how an extended HMM framework for segmentation and 
classification of Soccer video, can be optimised using model selection, yielding 
promising results. Finally, we conclude our work in Section 5. 



2 Hidden Markov Model Issues 

HMM is an effective tool for modelling time varying processes, belonging to a 
family of probabilistic graphical models able to capture the dynamic properties 
of temporal data [7]. Similar static representations, such as the Gaussian Mixture 
Model (GMM), do not model the temporal properties of audio data, hence the 
popularity of HMM in the fields of Speech Recognition [4,7], temporal data 
clustering [6,7] and more recently Video Retrieval [2, 3, 5, 8, 9]. An important issue 
when employing a continuous density HMM framework is model selection [6,4, 
7]. For example, a crucial decision is the selection of both an appropriate number 
of hidden states and (Gaussian) mixture density estimation per state. Accurate 
segmentation and classification is dependent on optimal selection of both these 
parameters. An insufficient number of hidden states will not capture enough 
detail, such as data structure, variability and common noise, thus loosing vital 
information required for discrimination between groups. A greater number of 
hidden states would encapsulate more content, though precise and consistent 
parameter estimation is often limited by the size and quality of the training data. 
As the number of parameters increase, so does the number of training samples 
required for accurate estimation. Larger more enriched models require a greater 
volume of training data for precise parameter estimation. A further problem 
with complex models is overfitting. HMMs, specifically designed to discriminate 
between content, can become too detailed and begin to mirror nuances found in 
unrelated groups, deteriorating classification accuracy. 

HMM application for Video Retrieval has so far been ad hoc, with little 
investigation into model selection and the potential side effects on system per- 
formance. In the literature, a common theme is to apply domain knowledge or 
intuition for HMM model selection. Such application includes shot boundary de- 
tection [3], news video segmentation and classification [5], TV genre labelling [8] 
and ‘Play’ or ‘Break’ segmentation [9] for Soccer video. This strategy can be 
helpful when matching a known number of potential states found in the data, 
such as shot segmentation [3]. However, there has been little research into how 
suitable this strategy is when applied to broad content classes found in video. 
For example, Wang et. al. [8] employ the same number of hidden Markov states 
for modelling entire video genre such as Sport and News, ignoring differences in 
the underlying structure found in each separate domain. 
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Eickeler et. al. [5], apply domain knowledge to News Broadcasts, building a 
superior HMM based on a preconceived topology. Each state of a superior HMM 
is represented by a simple HMM that models a broad content class found in News 
video. However, there is no investigation into model selection for these simple 
HMMs. Xie et al [9] segment and classify ‘Play’ and ‘Break’ segments for Soccer 
video, by using HMMs to model motion and colour distribution statistics. ‘Play’ 
segments correspond to camera shots that track the flow of the game. To model 
both segments, the authors use a series of simple HMM models, with a varying 
number of hidden states. For segmentation and classification, the output from 
each model is then combined using a dynamic programming (DP) algorithm, 
itself a first order Markov process. In fact, this application ignores the temporal 
properties of the HMM, suggesting a simpler classifier such as the Gaussian Mix- 
ture Model, applied in conjunction with the DP algorithm, may be as effective. 

3 HMM Model Selection 

The main goal of model selection is to choose the simplest possible model without 
a deterioration in performance. This is especially important given the difficulty 
and practicality of generating large, varied training sets. In this Section, we inves- 
tigate three model selection strategies and what effect each has on classification 
performance. The three selection strategies are: an exhaustive search approach, 
the Bayesian Information Criterion (BIC) [4,6] and the Akaike Information Cri- 
terion (AIC) [1] (formulae can be found in references). Exhaustive search, a 
simple linear search algorithm, involves training and testing a series of HMMs, 
where the parameter in question is iteratively increased until a stopping thresh- 
old is reached. For each iteration, the predictive likelihood of a HMM generating 
a test sample is calculated, also known as the out of sample log- likelihood. Using 
a stopping criteria on the predictive likelihood score is important. For example, 
increasing the number of states will also increase the predictive likelihood until 
each training sample is eventually modeled by its own unique hidden state. 

The two remaining strategies are BIC and AIC, both popular in the Statisti- 
cal literature. Each strategy penalises the predictive likelihood with a term that 
is derived from the number of parameters in the model. The major difference 
between approaches, is the derivation of this penalty term. The penalty term for 
AIC, only accounts for the number of free parameters in the HMM, while the 
BIC penalty term also factors in the amount of training data available. Smaller 
training samples will generate larger penalty scores, hence the advantage in pre- 
dictive likelihood found with more complex models is eventually outweighed by 
this penalty term. We then assume the optimal model is found at the maximum 
predictive likelihood score, avoiding the need to threshold. 

3.1 Data Set 

To evaluate each strategy, we generated a data set of 12 games ranging between 
2 to 3 hours in length. We manually labelled the audio into three main seman- 
tic content classes found in Soccer video; ‘Game’, ‘Studio’ and ‘Advertisement’. 
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‘Studio’ segments contain an introduction plus pre and post match discussion 
and analysis of the live game, usually set inside a controlled soundproof stu- 
dio. ‘Game’ segments consist of the live match, where the soundtrack contains 
a mixture of both commentary and vocal crowd reaction, alongside other envi- 
ronmental sound such as whistles, drums and clapping. ‘Advert’ segments can 
be identified by the almost chaotic mixture of highly produced music, voice and 
sound effects. Segmentation and labelling of these low level segments is beneficial, 
especially for reducing indexing errors during higher level tasks. For example, 
identifying the boundaries of a ‘Game’ segment is vital before event detection [2]. 
A decrease in precision would occur if the data was not pre-segmented and la- 
belled. Similar information from unrelated content such as music or sound effects, 
can then be wrongly identified as a key event. 

3.2 Number of Markov States Selection 

A series of HMMs were implemented, modelling the ‘Game’, ‘Studio’ and ‘Ad- 
vert’ content classes. The audio stream for each file was parameterised using 

14 Mel-Frequency Cepstral coefficients (MFCC) with an additional Log Energy 
measurement [7]. MFCC coefficients are specifically designed and proven to char- 
acterise speech well. MFCC has also shown to be both robust to noise and useful 
in discriminating between speech and other sound classes [2,4]. 

For each class, a series of ergodic, continuous density HMMs [7] with increas- 
ing number of states ranging from 1 to 20, were iteratively implemented. Each 
model was first generated from a training sample, then the predictive likelihood 
score was calculated on a separate test set. Both labeled data samples were gen- 
erated from the same pool of data and after one complete run, each sample was 
randomly changed. This was repeated 15 times to achieve a true reflection of the 
model generation process, limiting the effect of unusually good or bad runs. 

Importantly, each HMM was assigned a singular Gaussian density per hid- 
den state. An informal investigation using synthetic data, indicated that it was 
more important to identify the correct number of states first, to avoid searching 
through an unnecessary large number of hidden state and mixture combinations. 
For example, 100 HMMs of different state and mixture combination were im- 
plemented using data generated by a 6 state HMM, with 6 mixtures per state, 
Figure 1(a). Using the exhaustive search approach, increasing the number of 
mixtures did not effect correct hidden state number identification. As a result, 
we could first identify the optimal number of hidden Markov states for each 
content class, implementing a singular density function per state. Then in a sep- 
arate step, the optimal number of Gaussian density components could be found, 
reducing the number of parameter combinations to be implemented. 

Using the ‘Game’ class as an example, Figure 1(b) displays the mean of the 

15 initialisations, for all selection strategies. For the exhaustive search approach, 
the predictive likelihood increases as a new hidden Markov state is added to the 
model. There is a rapid rise that levels off between 14 to 20 states, suggesting the 
model was beginning to overfit the training data. A stopping threshold, empir- 
ically determined using synthetically generated data, Figure 1(a), was reached 
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(a) (b) 





Fig. 1 . (a) The predictive likelihood scores for HMMs with increasing state and mixture 
component number, (b) A comparison of model selection strategies for hidden state 
selection. Notice, both the AIC and BIC scores peak, while the predictive likelihood 
score continues to increase, (c) Classification accuracy versus number of hidden Markov 
states, (d) Classification accuracy versus the number of Gaussian mixture components. 



when adding a 14 th state. For the BIC strategy, the predictive likelihood also 
increased dramatically but peaked and then tailed off. The maximum BIC score 
was found to be 9 states. For the AIC strategy, a similar pattern occurred, where 
the maximum AIC score was found at 12 states. There was a similar trend for 
the remaining two content groups. BIC selected the simplest model followed by 
AIC, then the exhaustive search method. 

We also evaluated what effect iteratively adding hidden Markov states had 
on classification accuracy, Figure 1(c). As a comparison, the simpler GMM clas- 
sifier [7], which does not model time as a property, was used as a baseline. 
The mean classification accuracy gradually increased as new hidden states were 
added to the HMM. After the 5 th hidden state was added, the HMM began to 
outperform the GMM classifier. A 12 state HMM was found to be optimal for 
this content class, the same model selected using the AIC strategy. A similar 
pattern emerged for the remaining content classes. An improvement in classifi- 
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cation accuracy over the baseline GMM was recorded, when a certain number 
of states were added to the HMM. 

3.3 Number of Gaussian Mixtures per Markov State 

The same implementation issues arise with the selection of mixture components 
that model the emission densities per hidden Markov state. For example, speech 
recognition systems have identified that HMMs with multiple Gaussian mixtures 
perform better than those with a singular density function [4,7]. A mixture of 
Gaussian components can model the multi-modal emission densities that repre- 
sent variation found in speech. However, selecting too many mixture components 
can result in overfitting. Thus, we repeated the previous experiment, this time 
implementing HMMs with increasing Gaussian mixture components per state. 

For each strategy, each content class was modeled with mixtures ranging 
from 1 up to 10, fixing each HMM with the optimal number of hidden states 
identified in the previous section. For example, for one content class, 3 HMMs 
were implemented using the optimal number of states identified by each selection 
strategy. To limit overfitting further. The covariance matrices were constrained 
to be diagonal for each individual mixture, reducing the number of free param- 
eters. Each model setting was initialised 15 times, changing the data samples 
randomly after a complete run. Our findings again indicated that the BIC strat- 
egy selected the simplest model followed by AIC. The exhaustive search strategy 
again selected the more complex HMMs. 

We also analysed what effect iteratively adding Gaussian mixtures per model 
had on classification accuracy, Figure 1(d). From our results, we discovered a 
decrease in classification accuracy as mixtures were added to a singular density 
HMM. This trend was consistent across all strategies and for all content classes. 
Figure 1(d), is an illustration of a 9 state HMM for the ‘Game’ class, as the 
number of mixture components is iteratively increased. Classification accuracy 
decreases until 4 states are added, with a small reverse in trend afterwards. After 
three mixtures, the model performance became poorer than that of the GMM. 
This result was mirrored across the remaining two content classes and could be 
indicative of both poor parameter estimation given increased model complexity, 
as well as overfitting. To summarise. A singular density HMM produced the best 
classification accuracy when compared to the same HMM with multiple mixture 
components. 

3.4 Optimal HMM Model Evaluation Experiment 

In the previous section, we identified 3 optimal HMMs for each content class, 
using three selection strategies. Next, these HMMs were formally compared over 
a new test set, using a baseline GMM classifier for comparison. The test set was 
approximately 2 hours in length, divided into 10 second observation sequences, 
labelled into each content class. For all content classes, a HMM was first gen- 
erated from the labeled data used in the previous section. The HMM was then 
tested on the new sample. For each strategy, each new individual sequence was 
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Table 1. Confusion matrix. The % of correctly classified observations are in bold. 



Classification (%) 



Correct 

Class 


Game 

LIK BIC AIC GMM 


Studio 

LIK BIC AIC GMM 


Advert 

LIK BIC AIC GMM 


Total 


Game 


89.6 


92.7 


90.4 


90.4 


1.8 


1.1 


1.0 


2.9 


8.5 


6.2 


8.6 


6.7 


100% 


Stud 


4.6 


5.2 


5.1 


2.9 


89.1 


86.8 


87.6 


90.3 


6.2 


8.0 


7.2 


6.8 


100% 


Advt 


1.1 


1.0 


1.1 


1.4 


3.5 


3.4 


3.0 


3.7 


95.4 


95.6 


95.9 


94.9 


100% 



assigned to the content class that produced the highest HMM likelihood score, 
found using the Viterbi decoding algorithm [7] . 

The results in Table 1, indicated no significant difference in terms of clas- 
sification accuracy across all selection strategies, and across each content class. 
Overall, the ‘Studio’ classifier indicated the worst performance, where the major- 
ity of false classifications were samples with speech containing background noise, 
wrongly labelled as ‘Game’ or ‘Advert’. False classification from the ‘Game’ class 
again included sequences containing speech. These observations contained little 
or no environmental sound associated with the ‘Game’ class, resulting in mis- 
classification. Samples containing music played inside the stadium, or other pe- 
culiarities such as tannoy announcements, were also wrongly labelled into the 
‘Advert’ class. These sound events reflected similar content found in the ‘Advert’ 
class. The ‘Advert’ HMM produced the highest classification accuracy for all se- 
lection methods, where the majority of false classifications were labeled into the 
‘Studio’ category. These errors were typically clean speech samples. 

Given that the BIC selection criterion chose the simplest HMMs overall, 
there was no obvious detriment in performance. In fact the HMM selected by 
BIC for the ‘Game’ class, produced the highest classification accuracy. How- 
ever, the same selection strategy resulted in the lowest classification accuracy 
for the ‘Studio’ group. Interestingly, for the same content class the baseline GMM 
classifier recorded the best result. In fact, across all content classes, the GMM 
displayed comparable results when compared to the HMM. 



3.5 Discussion 

From experimentation, we illustrated the importance of model selection, where a 
gain in performance can be found when selecting HMMs methodically. For many 
Video indexing applications of HMM, this type of approach is not adopted [5, 
8,9], highlighting optimisation issues for each system. Selecting too few or too 
many hidden states can produce poor classification performance, as shown from 
the experimentation of three model selection techniques. 

The BIC method selected the simplest HMMs without significantly decreas- 
ing classification accuracy. In some cases, displaying a higher classification accu- 
racy than more complex HMMs. Also, the BIC strategy has an obvious advantage 
over an exhaustive search approach. The BIC penalty term creates a maximum 
peak in the predictive likelihood score. We assume this maxima to be the optimal 
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HMM. Hence, to find an optimal solution. The number of HMMs required to 
be implemented can be reduced by avoiding an iterative addition of parameters. 
For example, a bisection search strategy such as a Newton-Raplrson could be 
implemented to find the maximum BIC score. 

From experimentation, an important discovery was the effect increasing the 
number of mixture components had on classification accuracy. Adding further 
Gaussian mixtures to a singular density HMM, created a detrimental effect. In- 
creasing the complexity resulted in poor parameter estimation and overfitting. In 
most cases, after two or more mixtures were added, the baseline GMM recorded 
better results. In fact, for the task of classification, the HMM framework did not 
perform significantly better than the GMM overall. For this problem, GMM has 
been shown to be as effective when compared to the more complex HMM. 

4 A Segmentation and Classification System 

In this section, our aim is to segment and then classify Soccer video files using 
audio information alone. We present a case study, illustrating how optimally se- 
lected HMMs using BIC, can be integrated into a superior HMM framework [5]. 
This combination scheme utilises both domain knowledge as well as statistical 
model selection, where each optimised HMM becomes a single state in a unified 
HMM topology. This superior HMM allows for an entire video file to be seg- 
mented, classifying semantic segment units in a single pass. The advantage of 
applying this decision process is the ability to incorporate the temporal flow of 
the Video into the segmentation process, limiting error. For example, restricting 
movement from the ‘Advert’ to ‘Game’ segments can be mirrored in the state 
transition matrix in the superior HMM. Also, an input and output state, to note 
the beginning and end of each video file are included. 

To evaluate this technique, given the limited data set, we applied a ‘leave one 
out cross validation’. 11 complete video files were used for model training. The 
‘held’ out video was then used to evaluate the superior HMM. This procedure was 
repeated, holding out each video in turn, until segmentation and classification 
was achieved for all videos in the data set. We indexed all 12 video files using the 
Viterbi decoding algorithm, where each one second is assigned to a state in the 
superior HMM that represented a specific content class. An ambiguity window 
of 2 seconds was allowed for each segment change when comparing the indexed 
files with the manually generated truth data. This was to limit small alignment 
errors between the ground truth and the model output. 

The majority of segment boundaries were identified with 95.7% recall and 
89.2% precision. 97.9% of the segments were correctly labeled. Even allowing 
for the ambiguity window. Those segment changes that were not picked up cor- 
rectly were largely due to alignment errors, where the detected boundary was 
missed by a few seconds. False detections for segment change mostly involved 
wrongly identified segment transition between ‘Studio’ to ‘Game’ segments or 
vice versa. For example, false boundary changes were marked during a ‘Game’ 
segment where there was a decrease in crowd sound. A simple solution to this 
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problem would be to add a state duration element into the HMM framework. 
One complete ‘Game’ segment spans approximately 45 minutes. Incorporating a 
time distribution could avoid false classifications, especially during quiet spells 
in a ‘Game’ segment. 

5 Conclusions and Future Work 

In this paper, we investigated three HMM model selection strategies, examining 
factors that can effect the application of a HMM framework. We found the 
BIC selection strategy to be the most effective. By then incorporating optimal 
HMMs into a unified framework, we then illustrated how a superior HMM can 
be applied to both segment and classify the low level structure of Soccer video, 
yielding promising results. Labeling was achieved by modelling underlying audio 
patterns found in each semantic unit. 

Intended future work will include the extension of the superior HMM frame- 
work to include visual, motion and textual information sources. Another active 
area of interest will be incorporating the classification of smaller sub-groups such 
as crowd cheering for event detection [2], music and speech. Thus extending the 
HMM framework to include a more complete topology for the annotation of 
live Soccer broadcasts. Finally, we wish to compare this system against other 
frameworks, a requirement highlighted during experimentation. 
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Abstract. Motion Vectors (MV) indicate the motion characteristics 
between two video frames, and has been widely used in the content- 
based sports video analysis. Previous works on sports video analysis have 
proved the effectiveness and efficiency of the MV-based methods. How- 
ever, in the tennis video, the MV-based methods are seldom applied 
because the motion represented by MV is greatly deformed relative to 
the player’s true movement due to the camera’s diagonal shooting. In this 
paper, an algorithm of MV transformation is proposed to revise the de- 
formed MV using a pinhole camera model. With the transformed MVs, 
we generate the temporal feature curves and employ Hidden Markov 
Models to classify two types of player’s basic actions. Evaluation on four 
hours live tennis videos shows very encouraging results. 



1 Introduction 

As one of the most salient visual characteristics, motion feature is widely used 
in the content-based sports video analysis. In MPEG stream, Motion Vector 
(MV), extracted from the compressed video stream, reflects the displacement of 
a macro block, and most of current motion features employed in sports video 
analysis ground on MV. Duan et al. [1] give a comprehensive summarization 
of MV-based mid-level representations and corresponding implementations for 
sports game analysis. Also based on MV, Ma [2] calculates the motion energy 
spectrum for video retrieval, and in [3] the motion energy redistribution func- 
tion is proposed for semantic event recognition. It shows that the MV-based 
methods have efficient computation and effectual performance for most generic 
applications. 

However, the MV-based methods are seldom utilized in the tennis video anal- 
ysis. Conventional methods on tennis video analysis mainly focus on detecting 
and tracking of player or ball in the image sequence [4] [5], as well as incorpo- 
rating with human gesture and behavior analysis [6]. Although these computer 
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vision related methods may provide the elaborate annotation of tennis game, 
they have complicated implementation, inflexible utilization and nontrivial lim- 
itation. With our investigation, the main reason baffling the utilization of the 
MV-based methods in tennis video is that the camera is diagonal located but not 
perpendicular to the tennis court plane. And thus, the motion vector estimated 
from the tennis video can not correctly represents the player’s true movement. 
The magnitude of the MV is reduced and the orientation of MV is distorted. 
The deformation is particularly notable for the player at the top half court. To 
utilize the MV-based methods in tennis video analysis, the revision of MV must 
be resolved according to the player’s true movement. 

In this paper, an algorithm of motion vector transformation for tennis video 
analysis is proposed, in which the transformation is implemented based on the 
pinhole camera model. In this algorithm, the tennis court and net lines are de- 
tected to construct the pinhole camera model. Two types of player’s basic actions 
are classified to verify the proposed algorithm. In the classification, the temporal 
motion curves are generated and the HMM-based classifiers are built using our 
previous work [3]. Experiments on live tennis videos demonstrate the effectiveness 
and efficiency of the proposed algorithm of motion vector transformation. 

The rest of this paper is organized as follows. Section 2 presents the motion 
vector transformation by utilizing the pinhole camera model. In Section 3, the 
implementation of the transformation for classifying player’s basic actions is 
introduced. Then, experiments and discussion are given in Section 4. Finally, 
Section 5 presents the conclusion and future works. 



2 Motion Vector Transformation 

The camera in tennis game is usually placed right above the vertical symmetrical 
axis of the tennis court, and thus the rectangle court is transferred to an isosce- 
les trapezoid court, as shown in Fig. 1. Consequently, the player’s movement in 
tennis game is also distorted and the MV estimated from video sequence can not 
represent the true movement. As illustrated in the left of Fig. 1, a motion vector 
can be denoted as the displacement from a given point pi in the current frame 
to its corresponding point qi in the next frame. Supposing the player is watched 
moving from pi to q\ in video sequence, the true movement should be from p 2 
to g 2 , as shown in the right of Fig. 1. Comparing with the true motion, the mag- 
nitude of the estimated motion in video sequence is reduced and the orientation 
is also distorted. The distortion in the vertical direction is especially prominent, 
and there is always y\ < yi in Fig. 1. Such deformation makes it difficult to 
analyze player’s true movement based on the original MVs, for instance, we can 
hardly tell whether the player is taking the net or not just depending on the 
vertical projection of the MVs. However, this task would become feasible if the 
original MVs can be revised according to the actual movements. 

In fact, for points p\ and q \ , if the corresponding points P 2 and <72 in tennis 
court plane can be correctly located, the transformation will be achieved, i.e. 
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Fig. 1 . Illustration of the motion vector deformation in tennis video 
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Thus the essential problem is that for any given point in video frame, how 
to find the corresponding point in tennis court plane. To perform this task, a 
pinhole camera model is employed as illustrated in the top left of Fig. 2. For a 
pinhole camera, there is 

L/l = u/f (2) 

where u and / denote the object distance and the camera focus, L and l denote 
the lengths of object and image respectively. Supposing the horizontal distance 
between the camera and the bottom baseline of the true tennis court is d , and 
the height of camera from the ground is h , with Eq. (2), there are 

f W/w i = y/d 2 + h 2 /f 

l W/w 2 = \/(d + H) 2 + h 2 / f ( 3 ) 

l W/w 3 = V( d+2-H) 2 + h 2 /f 

Here W and H denote the width and half height of the true tennis court [8], 
and W \ , w 2 , w 3 respectively represent the lengths of the bottom baseline, net 
line and top baseline in the image plane, as shown in Fig. 2. 

For any given point p' in the trapezoidal court in image plane, the line pass- 
ing through p 1 and being parallel with the baselines is segmented by p ' and the 
two court sidelines into two parts, whose lengths are denoted as w x3 and w x2 re- 
spectively. The position of p' is uniquely represented by w x \ and w x2 . Supposing 
p is the corresponding point in the true tennis court of p\ p is uniquely repre- 
sented by X\, x 2 , y which denote the distances between p and the two sidelines 
and bottom baseline, as illustrated in Fig. 2. With the pinhole camera model in 
Eq. (2), the relations between (xi,x 2 ,y) and ( w x i,w x2 ) are 



W/(w x i + w x2 ) = \J{;y + d) 2 + h 2 /f 
Wxi/w x2 = x\jx 2 and x\ + x 2 = W 



(4) 
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Fig. 2. Pinhole camera model based transformation between the image plane and the 
true tennis court plane 



With Eq. (3), the parameters of d and h can be solved, thus for a given point in 
video frame (the image plane), the position of the corresponding point in the true 
tennis court plane can be calculated with Eq. (4) . Using the point transformation 
functions, the two end points of a MV are transformed to the true tennis court 
plane, and the new motion vector is calculated by taking the difference between 
the two transformed end points, as shown in Eq. (1). 

In most of the Game Shots of tennis video, the camera is usually appropri- 
ately placed and our assumption is approximately justified. With the robust line 
detection algorithm proposed in [7], the exact position of the trapezoidal court 
in tennis video, including the lengths of the borders and the coordinates of the 
corners, can be obtained through averaging the line detection results in several 
beginning frames of the Game Shots. Fig. 3 gives an example of the court line 
detection results in several consecutive video frames. The tennis court and net 
line in image are identified with the red lines. It is shown that the performance 
of the line detection algorithm is reliable in practice. 




Fig. 3. Results of tennis court line detection in consecutive video frames 



When the position information of the trapezoidal tennis court is obtained, 
all motion vectors in the Player Active Area are transformed to the true tennis 
court plane. The Player Active Area is defined as a larger isosceles trapezoid 



Tennis Video Analysis Based on Transformed Motion Vectors 



83 



covering the tennis court in video frame, as the trapezoid in dash-dot line shown 
in the right bottom of Fig. 2. 

3 Classification of Player’s Basic Actions 

In order to evaluate the performance of the motion vector transformation, we 
apply it to the semantic analysis of tennis video. A tennis video generally includes 
various scenes, but the most important shots are those Game Shots. In this paper, 
the Game Shots have been selected from the video sequence by the color-based 
approach in [4]. Two types of player’s basic actions are considered in current 
experiments: Net Game and Baseline Game. Net Game is that the player moves 
into the forecourt and toward the net to hit volleys, and Baseline Game is that 
the player hits the ball from near the baseline [8]. 

3.1 System Overview 

The system overview for classifying the player’s basic actions is illustrated in 
Fig. 4. For each Game Shot, the Original MVs are firstly extracted from the 
video stream, and then are fed into the proposed Transformation Algorithm 
to calculate the Transformed MVs. As introduced in Section 2, with the Line 
Detection algorithm, the position information of Tennis Court is identified to 
build up the Transformation Algorithm for the given Game Shot. 




Fig. 4. System overview for classification of Players’ Basic Actions 



Subsequently, based on the Transformed MVs, the method proposed in our 
previous work [3], which is validated in sports game event recognition, is em- 
ployed to classify Player’s Basic Actions. As described in [3], energy redistribu- 
tion measurement and weight matrix convolution are implemented on motion 
vector fields. First, the energy redistribution function provides a way to convert 
motion vector field to energy matrix, and then the weight matrix as Motion 
Filter is designed to detect the response of particular motion patterns. With 
the Motion Filter , the temporal multi-dimensional motion vector fields become 
a one-dimensional motion response curve called Temporal Motion Curve. More 
details can be found in [3]. In this paper, the horizontal and vertical motion 
filters are designed and two Temporal Motion Curves are generated to charac- 
terize the horizontal and vertical motions within the Game Shot. These curves 
as features are then used to classify the Player’s Basic Actions by using Hidden 
Markov Models, which will be detailedly introduced in the next subsection. 
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For classifying the basic actions of players in top half court and bottom half 
court respectively, the Original MV and the Transformed MV are divided into 
two parts with the detected net line. For comparison, two Temporal Motion 
Curves are respectively calculated based on the Original MVs and the Trans- 
formed MVs. Fig. 5 shows an example of the two Temporal Motion Curves of 
Net Game in the top half court. The X axis denotes the frame number and the 
Y axis denotes the calculated motion response value. Positive value on the ver- 
tical motion curve means movement to the net line, and movement to the right 
sideline for the horizontal motion curve. From frame 1 to 80, the player runs to 
take the net from the left end of the baseline toward the right end of the net, 
then from frame 81 to 134, the player walks back from the net to the right end 
of the baseline. As shown in Fig. 5 (a), both motion curves are quite noisy, and 
the vertical motion curve is too irregular to characterize the net approach move- 
ment. In Fig. 5 (b), the responses of horizontal and vertical motion filters are 
both enlarged, and the segment representing the net approach is more evident. 



400 

(a) 200 



Based on Original MVs 



i A 



Response of Horizontal Motion Filter 

Response of Vertical Motion Filter 






400 

(b) 200 



Based on Transformed MVs 



Response of Horizontal Motion Filter 
Response of Vertical Motion Filter 



Fig. 5. Comparison between Temporal Motion Curves built on (a) the original MVs 
and (b) the transformed MVs respectively 



3.2 HMM-Based Action Classification 

For player in certain half court, two HMMs for the net-game segment and 
baseline-game segment within a Game Shot are respectively built, as shown 
in Fig. 6. The baseline-game segment is modeled with a one-state HMM, with 
the continuous mixture of Gaussians modeling the observations. The component 
number of Gaussian mixture is selected as three, since the movements in baseline 
game are mainly composed of left, right and keep still. The HMM for net-game 
segment is a two-state left-to-right model, one state for approaching the net, and 
the other for returning the baseline. Each state is also modeled with a three- 
component continuous mixture of Gaussians. The observation vector for both 
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HMMs is of four dimensions, i.e. the response values of horizontal and vertical 
motion filters and their gradient values. 

As illustrated in Fig. 6, the two HMMs are then circularly connected to 
a higher-level model with an entering and an existing state, which represents 
the transition between net-game segments and baseline-game segments within 
a Game Shot. The transition probabilities from the entering state to the two 
sub-HMMs are both set to 0.5. The HMM model parameters are trained using 
the EM algorithm. Training data are manually chopped into homogeneous net- 
game/baseline-game chunks; EM for the net-game models is conducted over 
every complete net-game chunks, and vice versa for baseline-game models. In 
recognition phase, viterbi algorithm [9] is applied to find the global optimal 
state path given a observation sequence. 




Fig. 6. HMMs for classification of Player’s Basic Actions 



4 Experiments 

Four hours recorded live tennis videos are used in current experiments to vali- 
date the performance of the proposed algorithm. The experimental video data 
are collected from the matches of A. Agassi and P. Sampras at US Open 2002 
(V*deoi), and R. Federer and M. Philippoussis at Wimbledon 2003 (Vide 02 ). 
As ground truth, the Game Shot containing net game segment is labeled with 
Net game Shot (NS), otherwise it is labeled with Baseline game Shot (BS), for 
the two players respectively. The detail information of the selected video data is 
listed in Table 1. 



Table 1 . Information of the experimental video data 



Video 


#Shot 


#Game Shot 


Top Half 


Bottom Half 


#NS 


#BS 


#NS 


#BS 


Videoi 


881 


316 


58 


258 


230 


86 


Video 2 


624 


271 


100 


171 


114 


157 


Sigma 


1505 


587 


158 


429 


344 


243 
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Half of the experimental data are selected randomly as training data set, 
and each Game Shot in training set is further divided and labeled with net- 
game segments and baseline-game segments, for top half and bottom half courts 
respectively. Subsequently, for certain half court, the temporal curves of the 
vertical and horizontal motion filters are segmented based on the labels, for 
training the two HMMs respectively. In recognition, all Game Shot with net- 
game segment detected are considered as NS, vice versa they are BS. 

For comparison, the classification is performed based on the original MVs and 
the transformed MVs respectively. Table 2 illustrates the classification results 
based on the original MVs. When using the original MVs, the vertical motion 
responses between Net Game and Baseline Game can not be effectively distin- 
guished as indicated in Fig. 5 (a). In the top half court, the Net Games have 
no salient vertical motion responses, and many of them are incorrectly classified 
into Baseline Game. In the bottom half court, lots of noise MVs cause some of 
the Baseline games having semblable vertical motion responses with Net Games, 
and thus many Baseline Games are misclassified into Net Games. 



Table 2. Experimental results based on the original MVs 



Game Shot 


Top Half Court 


Bottom Half Court j 


Precision(%) 


Recall (%) 


Precision(%) 


Recall (%) 


NS 


30.91 


37.78 


64.55 


67.78 


BS 


70.53 


63.81 


47.75 


44.17 



Table 3 illustrates the classification results based on the transformed MVs. 
With the transformed MVs, the performances are improved notably. As the 
vertical motion responses are greatly enlarged than that of the original MVs, the 
distinction between Baseline Game and Net Game are more salient, especially 
for the top half court as shown in Fig. 5 (b). The precision and recall rates of 
Net Game classification in top half court are both doubled. 



Table 3. Experimental results based on the transformed MVs 



Game Shot 


Top Half Court 


Bottom Half Court j 


Precision(%) 


Recall (%) 


Precision(%) 


Recall (%) 


NS 


61.11 


73.33 


85.06 


82.22 


BS 


87.50 


80.00 


74.60 


78.33 



Sometimes the block-based estimation algorithm of MVs has unavoidable 
mistakes and errors, under which the misclassification is unable to be corrected 
even by the MV transformation. Furthermore, as the players are small scale in 
proportion to the whole image, the noisy MVs may greatly disturb the player’s 
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analysis. However, the experimental results indicate that in most conditions of 
tennis videos, the algorithm can properly revise the deformed MVs and enable 
the MV-based methods feasibly in the tennis video analysis. 

5 Conclusion 

An algorithm of motion vector transformation is proposed in this paper for the 
tennis video analysis. In this algorithm, the original deformed motion vectors are 
revised according to the player’s true motion. Through such a transformation, 
it is more feasible to employ the MV-based methods in tennis video analysis. 
Experiments on classification of player’s basic actions show very promising re- 
sults. The future works may include: (i) make some improvements in setting up 
the transformation, such as the location of the tennis court lines, (ii) reduce 
the disturbance of random noises caused by MV estimation, and (iii) try more 
applications in tennis analysis based on the MV transformation. 
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Abstract. In this paper we investigate the retrieval of semantic events 
that occur in broadcast sports footage. We do so by considering the 
spatio-temporal behaviour of an object in the footage as being the em- 
bodiment of a particular semantic event. Broadcast snooker footage is 
used as an example of the sports footage for the purpose of this research. 
The system parses the sports video using the geometry of the content 
in view and classifies the footage as a particular view type. A colour 
based particle filter is then employed to robustly track the snooker balls, 
in the appropriate view, to evoke the semantics of the event. Over the 
duration of a player shot, the position of the white ball on the snooker 
table is used to model the high level semantic structure occurring in the 
footage. Upon collision of the white ball with another coloured ball, a 
separate track is instantiated allowing for the detection of pots and fouls, 
providing additional clues to the event in progress. 



1 Introduction 

Research interests in high-level content based analysis, retrieval and summari- 
sation of video have grown in recent years [1]. A good deal of the interest has 
been focused on the detection of semantic events that occur in sports video 
footage [2,3]. This has been fueled primarily by the commercial value of certain 
sports and by the demands of broadcasters for a means of speeding up, sim- 
plifying and reducing the costs of the annotation processes. Current techniques 
used for annotating sports video typically involve loggers manually accounting 
for the events taking place [1]. The existing manually derived metadata can be 
augmented by way of automatically derived low level content-based features such 
as colour, shape, motion and texture [4]. This enables queries against visual con- 
tent as well as textual searches against the predefined annotations allowing for 
more subjective queries to be posed. 

* Work sponsored by Enterprise Ireland Project MUSE-DTV (Machine 
Understanding of Sports Events for Digital Television), CASMS (Content 
Aware Sports Media Streaming) and EU-funded project MOUMIR (MOdels for 
Unified Multimedia Information Retrieval). 



P. Enser et al. (Eds.): CIVR 2004, LNCS 3115, pp. 88-97, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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As humans operate at high levels of abstraction and since the most natural 
means for the lay person to query a corpus of data is through the use of semantics, 
it makes sense to develop algorithms that understand the nature of the data in 
this way. In order to do so, it becomes necessary to restrict the algorithms to a 
unique domain. These constraints enable low-level content based features to be 
mapped to high-level semantics through the application of certain domain rules. 

The necessity for automatic summary generation methods for sports is high- 
lighted by the fact that the semantic value of the footage spans short durations at 
irregular intervals. The remainder of the footage is generally of no consequence 
to the archetypal viewer (i.e. views of the crowd, breaks in play). Interesting 
events occur intermittently, so it makes sense to parse the footage at an event 
level. This offers the prospect of creating meaningful summaries while eliminat- 
ing superfluous activities. 

A common approach used to infer semantic events in sports footage is ac- 
complished by modeling the temporal interleaving of camera views [5]. This is 
typically carried out using probabilistic modeling techniques such as HMMs or 
NNs. This inherent temporal structure of some broadcast sports is not how- 
ever, evident in snooker footage. Thus, a model based on evolving camera views 
can not be used for the purposes of this research. Other works use deterministic 
methods [6] , but are limited in some regards with respect to the adaptivity of the 
models to changes in playing conditions. In this paper, we propose a novel ap- 
proach for the detection of semantic events in sports whereby the spatio-temporal 
behaviour of an object is considered to be the embodiment of a semantic event. 
For the case of snooker, in the appropriate camera view, the white ball is tracked 
using a colour based particle filter [7]. Parzen windows are used to estimate the 
colour distribution of the ball as it is a small object relative to the rest of the 
image. The implementation of the particle filter allows for ball collision detec- 
tion and ball pot detection. A separate ball track is instantiated upon detection 
of a collision and the state of the new ball can be monitored. Detection of such 
events augment the HMM decision process by providing a binary classifier where 
uncertainty is present. The evolution of the white ball position is modeled using 
a discrete HMM. Models are trained using six subjective human perceptions of 
the events in terms of their perception of the evolving position of the white ball. 
The footage is parsed and the important events are automatically retrieved. 

2 Shot Classification 

Similar to other sports, the finite number of fixed camera views used in broadcast 
sports footage are arranged in such a way as to cause the viewer to become 
immersed in the game while trying to convey the excitement of the match to 
a mass audience. In snooker, the typical views used are those of the full-table, 
close-ups of the player or crowd, close-ups of the table and assorted views of the 
table from different angles. 

For the purpose of this research we consider the most important view to be 
that of the full table. Analysis on 30 minutes of televised footage from three 
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different broadcast sources shows it to occupy approximately 60% of the total 
coverage duration. In this view all balls and pockets on the table are visible, 
enabling ball tracking and pot detection. It is therefore necessary to ensure that 
the camera views can be classified with high precision. 

Shot classification is accomplished using the method outlined in [8]. The 
footage is parsed at a clip level based on the geometrical content of the camera 
views. This approach does not require extraction of 3D scene geometry and is 
generic to broadcast sports footage which exhibit strong geometrical properties 
in terms of their playing areas. The temporal evolution of the derived feature 
is modeled using a first-order discrete HMM, allowing the views to be correctly 
classified. The system for parsing snooker footage is illustrated in figure 1. The 
relevant full table shots are passed to an event processor where tracking, pot 
detection and foul detection are performed. 



Raw video sequence 





EVENT 


Human 


RECOGNITION 




Training 



EVENT 

EXTRACTION 




Fig. 1 . System for parsing broadcast snooker footage. 
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3 Event Classification 

It was observed that the track drawn out by the white ball over the duration of a 
player’s shot characterises an important event. If the spatio-temporal evolution 
of the white ball’s position can be modeled, semantic episodes in the footage can 
be classified. We must firstly define the events of interest that occur in snooker 
in terms of the spatio-temporal evolution of the position of the white ball and 
how pots and fouls affect the event semantics. 



3.1 Events of Interest in Snooker and Game Heuristics 

In snooker, players compete to accumulate the highest score possible by hit- 
ting the white ball and potting the coloured balls in a particular sequence. The 
coloured balls vary in value from one (red) to seven (black), so different strate- 
gies must be employed to gain and maintain control of the table. The occurrence 
of a ball pot or foul (the white ball not colliding with a coloured ball at all) 
will affect the viewer’s perception of the event in hand. Priori domain knowledge 
makes use of these events, allowing a set of heuristics to be established which are 
used to evaluate the current maximum likelihood classification upon detection 
of a foul or a pot. This is illustrated in figure 3. 

The ‘plays’ we consider are characterised by the spatio-temporal behaviour 
of the white ball as follows (where C is the event number) and are affected by 
the state of the coloured ball (pot/no pot) and the white ball (foul/no foul). 

Break-building: <7 = 1. As the player attempts to increase his score he will 

try and keep the white ball in the center of the table with easy access to 
the reds and high valued balls. If a pot has been detected, the player is 
attempting to build a high break ( C = 1) (figure 2). In the unlikely event 
of one of the balls not being potted, the white ball will probably be in a 
position such that the remaining balls will be eminently ‘potable’. This is 
called an ‘open table’ event (C = 5). 

Conservative play: (7 = 2. Similar to the shot-to-notlring, except a coloured 

ball will not be potted when the white navigates the full length of the table. 
If this model is chosen as being the most likely, and a pot is detected, a 
slrot-to-nothing will be inferred ((7 = 4). This is because the ball will be in 
an area where it might prove difficult for a player to pot the next coloured 
ball in the sequence. 

Escaping a snooker: C = 3. If the player is snookered (no direct line of 

sight to a ball) he will attempt to nestle the white amongst the reds or send 
the white ball back to top of the table. If a pot is detected following the 
classification of a snooker escape, the heuristics will infer a break-building 
event (C = 1). As the only goal of the player will be to escape the snooker 
without conceding a foul or an open table if a ball is potted, it simply serves 
as a bonus. 

Shot-to-nothing: (7 = 4. The white ball is hit from the top of the table, 

traverses the table, and returns back to the top of the table. If a pot is 
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detected, the pot heuristics will infer a shot-to-nothing (C = 4) (figure 2). If 
there is no pot, the spatio-temporal evolution of the white ball position will 
show that the player is attempting to return the white ball to the top of the 
table. A conservative play event, (C = 2), could therefore be inferred as he 
is making the next shot as difficult as possible for his opponent. 

In all of these cases a foul by the white, flagged by a non-instantiated second 
track, or if the white is potted will result in a foul (C — 6) being inferred. Play 
will then be transfered to the opposing player. 

It was also observed that a snooker escape event is characterised by a cut 
from the full-table view to a close up view of the ball about to be hit. This occurs 
while the white ball is still in motion. If the velocity of the white ball, V > 0, a 
snooker escape is inferred (figure 3). 

3.2 Motion Extraction 

The proposed approach is similar to those methods used in handwriting 
recognition [9]. The position of the input device in these systems is easily 
obtainable through a stylus/pad interface. In the case of snooker however, 
the exact position of the white ball is not so readily available. Having located 
the full table views in the footage [8], a robust colour based particle filter is 
employed in order to keep track of the position of the white ball in each frame 
and simultaneously track the first ball hit. 

Localisation of the white ball: Events within clips are found by monitoring 
the motion of the white ball. As there is no camera motion in the full table view, 
the white is initially located by finding the brightest moving object on the table 
as it first starts moving. The semantic episode begins when the white ball starts 
moving and ends when it comes to rest. The implementation of the particle filter 
trivialises the accretion of these velocity values. 

3.3 Ball Tracking 

The tracker used in this work is similar to that implemented in [7]. The objects 
to be tracked however are significantly smaller (approximately 100 pels in size). 
We use the HSV colour space for our colour based probabilistic tracker. In order 
to facilitate an increase in resolution by selecting a small object relative to the 
size of the image, the colour distribution needs to be extended for both target 
and candidate models. Parzen windows are used to estimate the distribution of 
the hue and saturation components while the luminance component is quantised 
to 16 bins to reduce the effect of the lighting gradient on the table. 

A target model of the ball’s colour distribution is created in the first 
frame of the clip. Advancing one frame, a set of particles is diffused around 
the projected ball location using a deterministic second order auto-regressive 
model and a stochastic Gaussian component. Histograms of regions the same 
size as the ball are computed using the particle positions as their centers. A 
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Fig. 2. Tracking and table sections. Left to right: Shot-to-nothing; Break building; 
Spatial segmentation of the table. 



Bhattacharyya distance measure is used to calculate the similarity between 
the candidates and the target which is in turn used to weight the sample set, 
X - {(*> \ n = l-.../vj, where N is the number of particles used. The 

likelihood of each particle is computed: 




1 

. e 2 ^ 

y/2na 2 



(1) 



p (xk^) is the histogram of the candidate region at position Xk for sample n, £ 
is the target histogram and m is the number of histogram bins and a 2 = 0.1. 



3.4 Collision Detection 

A ball collision is detected by identifying changes in the the ratio between the 
current white ball velocity Vk and the average previous velocity v p (defined below, 
where d is the frame where the white starts its motion). 



1 

p ~ (k-2)-d 




(2) 



If the ball is in the vicinity of the cushion, a cushion bounce is inferred and d is 
set to the current frame. Ratios in the x and y velocity components v%/v p , v p /v^ 
are analysed to isolate changes in different directions. A collision is inferred when 
the condition in equation 3 is satisfied. 



hk = 



{(£§ <0 - 5 ) a (rIt >0 - 5 )} 



V {(ra <0 - 5 ) A ( 



{( 



l v fcl <0 5^ 
ApI < 

<0.5)} 



S >0.5)} 



J v Sl 



(3) 



The condition therefore flags an event when velocity changes by 50%. The form 
of the decision arises because the physics of colliding bodies implies that at 
collision, changes in velocity in one direction are typically larger than another 
except in the case of a ‘flush’ collision where a reduction of < 50% in both 
directions is exhibited. 
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Pot detection: Distinguishing between correct tracking and the loss of a ‘lock’ 
can be accomplished by using a threshold on the sum of the sample likelihoods, 
Li. If the cumulative likelihood at time k, L k > Li a correct lock is assumed, 
and the ball has been found. If L fe /L fc_1 < 0.5, the ball has been potted. 



3.5 Spatial Encoding of the Table 

The dimensions of the table, the positions of the balls and their values dictate 
the flow of the play to be mostly along the long side of the table (or without loss 
of generality, along the vertical) . The temporal behaviour of the vertical position 
of the white alone could therefore be considered to embody a semantic event. 
Using the fact that diagonals of a trapezoid intersect at its center, the table 
can be divided into 5 sections at the coloured ball’s spot intervals (figure 2). 
Initially, the table is divided by intersecting the main diagonals, retrieving the 
center line. Sub division of the two resulting sections retrieves the pink and 
brown lines, and so on. The starting and end positions of the white ball alone 
do not sufficiently represent a semantic event. The model must be augmented 
by the dynamic behaviour of the white ball. The observation sequence, O, is the 
sequence of evolving table sections. 



3.6 Establishing the Model Topology 

Modeling the temporal behaviour of the white ball in snooker is accomplished 
using a first order HMM. HMMs have been shown to be one of the most efficient 
tools for processing dynamic time-varying patterns and allow a rich variety of 
temporal behaviours to be modeled. The model topology is derived from the 
observations, reflecting the nature of the target patterns. A left-to-right/right- 
to-left topology is chosen to model the motion of the white ball for each event, 
revealing the structure of the events in state form. Each section is represented 
by a state in the HMM where the state self-transitions introduce time invariance 
as the ball may spend more than one time-step in any one section. 

Knowing the number of states (or sections of the table), N = 5, and discrete 
codebook entries, M = 5, a model A, can be defined for each of the competing 
events. A succinct definition of a HMM is given by A c = (A C ,B C , tt c ), where 
c is event label. The model parameters are defined as: A, the state transition 
probability matrix, B, the observation probability matrix, and i r a vector of 
initial state probabilities. 

The Baum- Welch algorithm is used to find the maximum likelihood model 
parameters that best fit the training data. As the semantic events are well under- 
stood in terms of the geometrical layout of the table, the models can be trained 
using human understanding. Six separate human perceptions of the events listed 
in section 3.1 were formed in terms of the temporally evolving table coding se- 
quence of the white ball. The models used are shown in figure 3 with an example 
of a single training sequence. The models are initialised by setting n" = 1 where 
n = 0\. 
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Fig. 3. Event HMMs with pot and foul classifiers. 



Each semantic episode can then be classified by finding the model that results 
in the greatest likelihood of occurring according to equation 4. 



C = arg max [P(0|A c )], 



C = 4 events. 



(4) 



4 Results 

Experiments were conducted on two footage sources ( F1,F2 ) from different 
broadcasters of 17.5 and 23.2 minutes in duration. 21 occurrences of the events 
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to be classified were recognised in FI, of which 11 were break-building, 6 were 
conservative plays, 2 shot-to-nothings, 1 open-table, 0 snooker escapes and 1 
foul. 30 events occurred in F2 of which there were 16 break-building, 8 conser- 
vative plays, 1 slrot-to-nothing, 2 open tables, 2 snooker escapes and 1 foul. The 
classification results 1 are assessed by computing the recall (R) and the precision 

(P)- 

P = A^C P = AFB (5) 

A is the number of correctly retrieved events, B the number of falsely retrieved 
events and C the number of missed events. 



Table 1. Event classification results. 



Event type 


FI (P) 


FI (R) 


F2 (P) 


F2 (R) 


Break-building (C = 1) 


91.67% 


100% 


94.12% 


100% 


Conservative play (C = 2) 


100% 


100% 


100% 


75% 


Snooker escape (C = 3) 


N/A 


N/A 


100% 


100% 


Shot-to-nothing (C = 4) 


100% 


50% 


100% 


100% 


Open Table (C = 5) 


100% 


100% 


66% 


100% 


Foul (C = 6) 


100% 


100% 


50% 


100% 



In FI the only misclassification was that of a shot-to- nothing being classified 
as a break building event. In F2 a problem arose in the classification of two 
conservative plays. One was misclassified as a foul due to light contact being 
made by the white with a coloured ball and a collision was not detected, while 
the second was misclassified as an open table event. 

5 Discussion 

In this paper we have considered the dynamic behaviour of an object in a sport as 
being the embodiment of semantic episodes in the game. Modeling the temporal 
evolution of the low level feature in this way allows important episodes to be 
automatically extricated from the footage. Results obtained are promising using 
the most relevant 60% of footage. Augmenting the feature set with more tracking 
information could improve the retrieval further. We are currently attempting 
to use the same process to classify semantic episodes that occur in broadcast 
tennis footage. Furthermore, we are investigating the possibility of generating 
game summaries where the excitement of each match could be gauged by the 
frequency of different events. 
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Abstract. This paper presents a novel scheme for indexing and segmentation of 
video by analyzing the audio track using Hidden Markov Model. This analysis 
is then applied to structuring the soccer video. Based on the attributes of soccer 
video, we define three audio classes in soccer video, namely Game-audio, Ad- 
vertisement-audio and Studio-audio. For each audio class, a HMM is built us- 
ing the clip-based 26-coefficients feature stream as observation symbol. The 
Maximum Likelihood method is then applied for classifying test data using the 
trained models. Meanwhile, considering that it is highly impossible to change 
the audio types too suddenly, we apply smoothing rules in final segmentation 
of an audio sequence. Experimental results indicate that our framework can 
produce satisfactory results. 



1 Introduction 

Studies have been reported in the literature addressing sports video structuring. Prior 
works include syntactical segmentation [1, 2] and semantic annotation [3, 4], And 
people pay much effort to the image sequence. But video sequence is a rich multimo- 
dal information source, containing audio, text, image, etc. Efficient indexing and 
retrieval of video requires taking multi-cues from video sequence into account. Audio 
as a counterpart of visual information in video sequence got more attention recently 
for video content analysis [5, 6], On the other hand, Hidden Markov Model [7] has 
good capability to grasp the temporal statistical property of stochastic process and to 
bridge the gap between the low-level features from video data and the high-level 
semantics the users are interested in. The emphasis of this paper is applying the HMM 
to automatic audio classification and segmentation for final soccer video structuring. 
The clip-based audio features are extracted and used to train the HMMs of the three 
kinds of soundtrack in soccer video. The smoothing rules improve classification and 
segmentation accuracy. The results and related discussions are given. 

The rest of this paper is organized as follows. In Section 2, we summarize three 
audio classes in soccer video and propose the framework of automatic audio classifi- 
cation and segmentation for soccer video structuring. In Section 3, we explain the 
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Game Advertisement Studio Advertisement Game 



Fig. 1. Typical sequence of segments in soccer video 

modules of our system: audio features extraction, HMM training, testing data classifi- 
cation, smoothing and segmentation. Section 4 is experimental results and discussion. 
At last the conclusions and future works are given in Section 5. 



2 The Framework of Audio Classification and Segmentation for 
Structuring Soccer Video 

Figure 1 contains a simplified diagram of a typical sequence of segments in soccer 
video. The whole soccer video is composed of five sequential semantic parts: First- 
half game, Advertisement, Studio, Advertisement and Second-half game. Usually, the 
soundtrack of Game is noisy speech; the one of Advertisement is the mixture of 
speech and music; while the one of Studio is pure speech. These three kinds of 
soundtracks that correspond to different segments have distinct low-level features. 
That is to say, video structuring is achieved as long as the soundtrack is classified and 
segmented. Therefore, we define three types of audio in soccer video for automatic 
audio classification and segmentation, namely Game-audio , Advertisement-audio and 
Studio-audio. 



Soccer 

video 



Training phra se 

Extract 
features from 
training data 



Build HMMs 
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(Baum-Welch algorithm) 
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Classify Audio 
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■ ructuring 
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Fig. 2. The framework of automatic audio classification and segmentation in soccer video 
based on HMMs 

Then we can convert the video indexing problem into the audio classification 
problem. Unlike previous approaches, we want to propose a stochastic model rather 
than fine-tuning one. We want to expand on these three kinds of audio for many more 
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audio-events. With this in mind, we use the Hidden Markov Model for automatic 
audio classification and segmentation in soccer video. 

Diagram of our system is shown in Figure 2. Automatic audio classification and 
segmentation for soccer video structuring is processed in two steps: the first is train- 
ing phrase and the second is decision phrase. In training phrase, clip-based audio 
features are extracted from training data and three HMMs are built using Baum- 
Welch algorithm. Then, in decision phrase, the same audio features are got from 
testing data. Classify the audio sequence with the Maximum Likelihood method 
based on the three HMMs from the first step. Finally the smoothing rules are used to 
improve the segmentation accuracy. Thus the final soccer video structuring is fin- 
ished. We explain the modules of two phrases in detail in next section. 



3 Audio Classification and Segmentation for Structuring Soccer 
Video Using HMMs 

HMM has been successfully applied in several large-scale laboratory and commercial 
speech recognition systems. In traditional speech recognition system, a distinct HMM 
is trained for each word or phoneme, and the observation vector is computed every 
frame (10-30ms). Here we do not need to grasp the detail information at the resolu- 
tion of several milliseconds. What we are interested in is the semantic content that 
can only be determined over a longer duration. Based on this observation, the basic 
classification unit in our system is not a frame, but a clip. 

3.1 Audio Features Extraction [8] 

The audio signal is sampled at 22050 Hz and 16 bits/sample. The audio stream is then 
segmented into clips that are 1 second long with 0.5 second overlapping with the 
previous ones. Each clip is then divided into frames that are 512 samples long. For 
each frame, we extract the feature vector with 26 coefficients as follows: the sound- 
track is preprocessed using a 40 channels filter-bank constructed using 13 linearly- 
spaced filters followed by 27 log-spaced filters. Cepstral transformation gives 12 mel 
frequency cepstral coefficients (MFCC), 12 MFCC Delta and 2 energy coefficients. 
The MFCC Delta is computed as 

2 ( 1 ) 
A c n ( m ) = ^ kc n _ k ( m ) * 0.56 , 1 < m < P . 

k——2 

Where C stands for MFCC coefficient, Ac is MFCC Delta coefficient and P is de- 
fined as 12. 

So for each clip, this gives a 26 coefficients feature streams. 
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3.2 Building HMMs for Three Audio Classes 



We borrow the Hidden Markov Models (HMM) from the speech recognition field for 
automatic audio classification and segmentation, where they have been applied with 
great success [7,8]. 

We model three audio classes using a set of states with a Markovian state transi- 
tion and a Gausssian mixture model for observation probability density in each state. 

We use continuous density model in which each observation probability distribu- 
tion is represented by a mixture density. For state j, the probability bj (Oi) °f gener- 
ating observation Qt ' s given as follow: 



M ; 



bj = c jm G(fi jm , X 



jm 



,Od- 



( 2 ) 



Where G(/U,Y,0) is the multivariate Gaussian function with mean jl and covari- 
ance matrix Y > M j is the number of mixture components in state j and Cj m is the 

weight of the mth component. In our system the observation symbol is the 26 coeffi- 
cients vector as we mentioned earlier, and Mj is defined as 15. 

For each audio class, an ergodic Hidden Markov Model with 3 states is used to 
model the temporal characteristics. With q t , denoting the state at instant t and q HI the 
state at t+1, elements of matrix A are given as follows: 

aij = P(q t+1 = j\q t = 0- ( 3 ) 



The parameters of the model to be learnt are the state transition probability distri- 
bution A, the observation symbol probability distribution B and the initial state distri- 
bution ji. The model is simply referred to as A = ( A,B,7T ) . The Baum-Welch re- 

estimation procedure is used to train the model and learn parameters^. 

In theory, the re-estimation procedure should give values of the HMM parameters 
which correspond to a local maximum of the likelihood function. Therefore a key 
question is how we choose initial estimates of the HMM parameters so that the local 
maximum is the global maximum of the likelihood function. Here, initial estimates 
are obtained by segmental £-means procedure. 

To solve the problem of underflow in training, we perform the computation by in- 
corporating a scaling procedure. Here, in order to have sufficient data to make reli- 
able estimates of all model parameters, we use multiple observation sequences. The 
modification of the re-estimation procedure is straightforward and goes as follows. 
We denote the set of K observation sequences as O = [0 l \0 { ] ,...0 K) ] . where 

O iK) is the Ath observation sequence. Then: 
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tj-j'Lrf'U')- 



k=l 



(4) 



K 71-1 K 71-1 (5) 

« 9 = IIC(MVZZr, m (0- 

k=l t = 1 k = 1 ;=1 

_ K T, K T l (6) 

CW=Z tr! n U)'ttr! n (j)- 

/=1 i=l ;=1 r=l 

s.?.o{' ) =v J 

Where yAj) denotes the probability of being in state / at time t, and £ (7, / ) denotes 
the probability of being in state i at time t while state j at time t+1. 



3.3 Audio Classification and Segmentation Using HMMs 

Once the parameters are learnt with the training data, the models can then be used to 
perform maximum likelihood classification for each clip. The classification approach 
leads to segmentation of the audio stream where each segment gets the label of the 
classified model. This label can be used along with the temporal information for in- 
dexing. The likelihood assigned by the classification to each label reflects a degree of 
confidence in the accuracy of the label. This can be used to avoid hard threshold 
while indexing. 

Thus segmentation of an audio stream is achieved by classifying each clip into an 
audio class in soccer video. Meanwhile, considering that the audio stream is always 
continuous in video program, it is highly impossible to change the audio types too 
suddenly or too frequently. Under this assumption, we apply smoothing rules in final 
segmentation of an audio sequence [6]. The smoothing rule is: 

Rule 7/(c[l] * c[0] && c[ 2] = c[0]) then c[ 1] = c[0] . ( 7 ) 

Where three consecutive clips are considered, c[l],c[0],c[2] stand for the audio 
class of current clip, previous one and next one respectively. This rule implies that if 
the middle clip is different form the other two while the other two are the same, the 
middle one is considered as misclassification. 

After smoothing the classification results, the segmentation accuracy is improved 
and the final segmentation is finished. 



4 Experimental Results and Discussion 



Three soccer video programs used in our experiment are briefly described in Table 1. 
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Table 1. Soccer video programs used in our experiment 



No. Soccer video Name Length Source 

24m38s Sports channel of HNTV 
on 03/15/2003 



Soccerl English premier football 
league: Astonvilla vs. Man- 
chester utd 

Soccer2 English premier football 
league: Fulham vs. Man- 
chester utd 

Soccer3 German Division One 
League: Bayern Munich vs. 
Leverkusen 



25m2s Sports channel of HNTV 
on 03/22/2003 

10m53s Sports channel of HNTV 
on 09/20/2003 



In our experiments, HMMs are trained on one program and tested on other pro- 
grams. This process is repeated three times [2]. The first result measurement is the 
classification accuracy , defined as the number of correctly classified clips over total 
number of clips. Training and testing accuracies are shown in Table 2. Average clas- 
sification accuracy (avg-cla) of each program as testing data is computed as the mean 
of elements of current row; similarly, average generalization accuracy (avg-gen) is 
computed for the program as training data; and the overall average classifica- 
tion/generalization accuracy over the entire dataset is put in the lower right corner. 
From Table 2, it is easily found that the algorithm performance is satisfactory. 



Table 2. Classification accuracy 



Testing Data 




Training Data 


Avg-cla 


Soccerl 


Soccer2 


Soccer3 




Soccerl 


Game-audio 


0.9196 


0.8228 


0.7998 


0.8474 




Ads-audio 


0.9310 


0.8598 


0.8964 


0.8957 




Studio-audio 


0.8646 


0.8917 


0.88 


0.8788 


Soccer2 


Game-audio 


0.8420 


0.9891 


0.8758 


0.9023 




Ads-audio 


0.9400 


0.9626 


0.8598 


0.9208 




Studio-audio 


0.8486 


0.7810 


0.8333 


0.8210 


Soccer3 


Game-audio 


0.8793 


0.8470 


0.9347 


0.8870 




Ads-audio 


0.8816 


0.9075 


0.9208 


0.9033 




Studio-audio 


0.8114 


0.8767 


0.9066 


0.8649 


Avg-gen 


0.8798 


0.8820 


0.8756 


0.88 



Since our goal is to do joint classification and segmentation in one-pass, we are 
also interested in measuring the segmentation results. The final classification and 
segmentation sequence with Soccer2 as testing data and Soccerl as training data is 
shown in Figure3. For Soccer2, the program begins with Game, followed by the 
commercial breaks in half time. Then the video is continued with Studio scene. It is 
noticeable that between the 1500 th clip and the 2000 th clip the Studio audio is fre- 
quently interleaved with Game audio. This outcome accord with the fact: in the Stu- 
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dio scene of Soccer2, when the anchorman comments on the first-half game, the pro- 
gram is usually switched to the corresponding game scene. Then the following se- 
quence is another period of time of advertisement and second-half game. Obviously, 
we take the step points as the segmentation points. 



the final classification and segmentaion sequence with Soccerl as training data and Soccer2 as test data 

5 | 1 « r 1 1 1 I 1 



Segmentaion Points 



Audio type 



3 stands for Ganpe Audio 



2 stands for Ads Audio 



1 stands for Studio Audio 



0 i l i i i i i i 

0 500 1000 1500 2000 2500 3000 3500 4000 4500 



Soccer2 clips 

Fig. 3. The final classification and segmentation sequence with Soccer2 as testing data and 
Soccerl as training data 



We define segmentation-point-ojfset be the absolute clip difference between the 
nearest segmentation point in detection result and every segmentation point in the 
ground-truth. And the distribution of segmentation-point-ojfset over all testing condi- 
tion is used to measure the segmentation accuracy. The result shown in Table3 indi- 
cates that more than 70% of the segmentation points are detected within a 3-clips long 
window. 



Table 3. Segmentation-point-ojfset Distribution 



Segmentation-point-offset 


[0,5) 


[6,10) 


[11,15) 


[16,20) 


>=20 


Percentage 


73% 


5% 


8% 


6% 


8% 



5 Conclusions and Future Works 

In this paper, we have described a novel soccer video structuring approach using 
audio features. We develop a Hidden Markov Model based on the characteristics of 
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audio in soccer video. The classes supported are Game-audio, Advertisement-audio 
and Studio-audio. Making use of these three audio classes, we index and segment the 
soccer video. The preliminary experiments indicate the proposed technique is feasible 
and promising. Our framework is generic enough to be applicable to other sports 
video, such as tennis, volleyball etc, and even other video type. And it can be applied 
to the detection of other audio-events because in our system there are not specific 
algorithm or threshold tune-ups. 

In future works, we will enhance the performance of our work in two ways: Since 
HMM is the an efficient model of mapping the low-level observations and high-level 
semantics, the first direction for future research is to invent better features for better 
performance. On one hand, we should go deep into audio processing; on the other 
hand, combing with visual information ought to be noticed. Then another future work 
is the improvement of Hidden Markov Model. Unfortunately, there is no simple, 
theoretically correct, way of choosing the type of model (ergodic or left-right or some 
other form), choosing of model size (number of states) etc. So we can choose differ- 
ent numbers of states and different numbers of mixture components for the HMM to 
improve the accuracy. In this direction, the focus is automatically deciding these 
parameters using some optimality criteria. 
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Abstract. A novel and expandable video summarization model called EDU is 
proposed. The model starts from video entities, gets utilities after descriptions, 
and finally generates video summarization according to the utilities. As a gen- 
eral expandable model, EDU has a description scheme to save the preprocessed 
information, and a utility function based on the descriptions. The concepts and 
structures of EDU model are described in detail, and a method of news story 
summarization based on this model is also proposed. The experiment proves 
the effectiveness of the method. 



1 Introduction 

With the rapid development of multimedia techniques, there emerge abundant digital 
videos, such as news, advertisements, surveillance videos, home videos, etc. These 
changes promote new techniques on the storage, indexing and accessing of videos. 
Among them, one important problem is how to browse large quantities of video data, 
and how to access and represent the content of videos. The technique of video sum- 
marization can solve these problems to some extent. 

Video summarization is a short summary of the content of a long video document. 
It is a sequence of still or moving images representing the content of a video in such a 
way that the target party is rapidly provided with concise information about the con- 
tent while the essential message of the original is well preserved [ 1 ]. 

Researches on the video summarization technique can be traced back to the In- 
formedia project [2] developed by Carnegie Mellon University. Later, it was widely 
studied by various universities and organizations, such as Columbia University [3], 
AT&T laboratory, Intel Corporation, Mannheim University [5], Microsoft Research 
Asia [6], etc., and many advanced algorithms and methods have been proposed [7,8]. 

There are about six popular types of video summarizations, namely titles, posters, 
storyboards, skims and multimedia video summarizations. But most works fall short 
of a unified model to supervise the generation of video summarization. The purpose 
of this paper is to build a general video summarization model and realize news video 
summarization based on this model. 
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2 EDU Model 

We propose a video summarization model- EDU, which is the abbreviation of Entity- 
Description-Utility. First, we’ll introduce some concepts. 



2.1 Related Concepts 

Definiton 1. Entity. The so-called entity is the existence in videos. It can be notional, 
or physical. From top to bottom, we regard all video files, stories, scenes, shots and 
frames as entities. Entities at different levels form the structure of videos, and each 
entity has its attributes and predications. For example, frame is an entity, while each 
pixel is an attribute of the frame, and the position of each pixel is the predication. The 
high-level entity is formed from low-level entities; such as the shot entity is formed 
from many frame entities. 

Entity is, in fact, a subset of videos. Supposing a video segment containing N shots, 
then the Ath shot can be described in this way: 

Shot k ={f s e P(f) | start(k) < f s < end(k)},ks [1,1V] (1) 

Where f s is a frame, P(f) is the set of all frames, start(k) and end(k) is the start and the 
end frame number of shot k respectively. 

Definition 2. Description. Description is the abstract and general explanation of an 
entity. Different from the original information, description is the processed and 
extracted information, which can be more understandable. Different levels of entities 
have different descriptions. We define descriptions of entities of videos, stories, 
scenes, shots and frames as D , D sl , D sc , D sh , D f respectively. The description of an 
entity is formed from several descriptors. For example, entity E k can be described as 
follows: 

D Ei = {d kl ,d k2 ,d k3 ,..^ 

Where d kI , d t2 , d u , ... are descriptors of the entity. Users can add descriptors to the 
entity. 

Definition 3. Utility. Utility is the contribution of an entity. In other words, it 
explains how much work the entity does in representing the video content. We use 
descriptions of each entity to evaluate the utility. And by the utility function, we can 
get a series of utilities. Based on these utilities, we can finally generate video 
summarization. 



2.2 The Formal Description of EDU Model 

We describe EDU model as follows: 



EDU = {E,D,U,(p} 



( 2 ) 




108 



Y.-X. Xie et al. 



where E is the entity set, D is the description set, U is the utility set, and tp is the 
relationship set of these sets. 

Supposing E v , E st , E sc , E sh , E, are the set of video entity, story entity, scene entity, 
shot entity, frame entity respectively, and has the following relationships: E f c E sh 

c£ czE <zE c E , then EDU model can be described as follows: 

— sc — st — v — 

U = <p(E a ) = <p i -<p 2 -<P\(E a ) (3) 

where E a <z E,a e {f,sh,sc,st,v} ■ (p t , (p 2 ,(p 2 means three types of operations, namely 

entity-to-entity, entity-to-description, description-to-utility. Then the above formula- 
tion can be described further in the following way: 

where Ef) ={e u e 2 ,...,e n j ■ (4) 

The equation reflects the process from entity to entity, we can think of (p as the 
operation of video segmentation or clustering. The equation shows that after opera- 
tion (p x on the entity E a - we get n entities, which can be expressed as e /s ■ For exam- 
ple, supposing E a is a video entity, e^ is a story entity, then the above equation 
means after story detection, we get n stories. Similarly, supposing E a is a shot entity, 
then after clustering operation (p x , we can get the scene entity p /; . 

Further, the process of entity to description can be described as: 

D = (p^Ep ) , where D = {d tj \\<i< n, 1 < j < m) ■ (5) 

After the operation from entity to description, the description set D can be ob- 
tained. As each entity can be described with m descriptors, the descriptor set would 
have n x m elements. Where d q means the j th description of the entity i . 

Thirdly, generating utility from descriptions, which can be shown as follows: 

U = (p, (D), where U = (u l ,u 2 ,...,u n ) T (6) 

where (p,, means the utility function. Supposing it is the simple weight sum function, 

then u = y' w . where ^ w _ ^ , and d jf is the normalization utility of the j th 

7=1 7=1 

description of entity i. 



2.3 The Structure of EDU Model 

EDU model reflects the idea of generating summarization by the method of getting 
descriptions from entities and thus get utilities. (Ref. fig.l) 
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Fig. 1. Structure of EDU model 

First, the original video streams should be segmented to get different levels of enti- 
ties. According to different applications, different entities will be chosen. For exam- 
ple, to get summarizations of news stories, it is proper to choose shot as the basic 
entity; otherwise, to get summarization of a news topic, it’s proper to choose news 
story as the basic entity. 

Second, entities of different levels should be described automatically or half- 
automatically. For example, a shot entity can have many descriptors. If a face is de- 
tected in a shot, then add face descriptor to the shot entity, and save the information 
of occurrence time and position of the face. Other descriptors can also be added to a 
shot entity. 

Finally, each entity’s utility can be got by the utility function based on the de- 
scriptors. These series of utilities would be the basic measurement of the video sum- 
marization. 

There are at least two advantages in the EDU model. First, it has a sharable de- 
scription system, which is used to save preprocessed information. And the second, it 
has a utility function based on the descriptions, which is the measurement of video 
summarization. The first advantage shows its ease of expansibility, while the second 
one reflects users’ interests in the summarization. 



3 News Video Entity Description Based on EDU Model 

As mentioned above, videos include five levels of entities, namely video stream, 
story, scene, shot and frame. As to news videos, the scene entity is neglected here. 
Different entities would be assigned with different descriptions. 

Video entity description (DJ is the highest-level description, which describes the 
theme and the classification of a video, and answers questions like “what type is this 
video?” 
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News story entity description (DJ describes the topic and the clou of a story, tells 
users “what has happened?” 

Shot entity description (D sh ) describes the content of a shot from the physical and 
perceptive aspect. The content of shot description can be low-level features, such as 
duration, histogram, or can be high-level perceptive features, such as characters, 
scene types. 

Frame entity description (D.) is the lowest level description, which seldom includes 
semantics, but mainly includes physical features such as colors, textures, etc. 

All these descriptions form the description system. 



4 News Story Summarization Method Based on EDU Model 

News story is a basic unit for people to understand the news. Summarizing a news 
story is to give a quick view of the news story and preserve the important content. 
Considering the characters of news videos, we use shot as the basic entity for news 
story summarization. 



4.1 From Shot Description to Shot Utility 

Shot utility relies on shot description. It includes shot type utility, face utility and 
caption utility, etc. In this section, we will discuss how to calculate utilities from 
descriptors, and get the utility of a shot by the utility function mentioned in section 2. 



Shot Type Utility. Each news story is formed from several shots. The similar shots 
can appear in a news story several times. For example, some scenes of the crowds 
may be edited to appear in a leader’s announcement. Obviously, the similar shots are 
redundant. They should be identified, and only the representative ones should be 
picked out 

This can be accomplished by clustering similar shots. We adopt the key frame’s 
histogram recorded in the key frame descriptors to be a feature of a shot, and use the 
£-means clustering method to cluster similar shots. 

Supposing we have got N clusters of shots in a news story, we define the weight 
of the 1 th cluster to be W t [7] : 




j = i 



(7) 



Where S / is the sum of duration of the /th cluster. 

Generally, short and similar shots are not important, while those long and seldom 
appearing shots may be more important. This means with the rising of a cluster’s 
weights, the shot’s importance will decrease. So we define the importance of shot j of 
cluster k to be /. (it is not a utility yet for it has not been normalized) [7]: 
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Ij = Lj ' log W k (8) 

Where L is the duration of shot j, W t is the weight of cluster k. 

We conclude from the equation that with the rising of a cluster’s weights, the im- 
portance of the shot is decreasing, while with the rising of duration of a shot, so does 
the importance. Then assign the utility of the shot with the highest importance in a 
news story as one, and the utility of the shot with the lowest importance to be zero. 
After normalization of the utilities, we can finally get the shot type utility d l ■ 



Face Utility. The face descriptor of a shot includes the face picture itself, and the 
occurrence time of the face, the position of the face, etc. We can define their utility by 
the physical features. 

Generally, users will pay more attention to a face in the central screen. We define 
the importance of a face as follows [6]: 



N 






/ _ \ "A p°s 

Lface ~ Li A A o 

k = 1 frame ® 



(9) 



Where I/ ace is the importance of the face in a shot, N is the total number of faces in a 

frame, A k is the face’s area in the frame k. A /rame is the area of the frame, and Wpos is 

8 

the position weight. 

We can see from the above equation that / g [o,l), and will be far less than 1 be- 
cause the area of a face is only a small part of a frame. Similar to the shot type utility, 
we assign the face utility with the highest importance to be one, and the face utility 
with the lowest importance to be zero. After normalization of the utilities, we can 
finally get the face utility d 2 ■ 



Caption Utility. Captions are frequent in news videos. They are titles of news stories 
or names of leaders. Captions are added for better understanding of news topics; only 
a little caption can be used as the description of the current shot such as names of the 
leaders. 

As for the title captions, we think of them as the source of story description, and set 
the shot caption utility to be zero. For the annotation captions, we simply set their 
utility as 1. And for the shots with no caption, caption utility is obviously zero. And 
we mark the caption utility as d, ■ 



Other Utilities. There are many other descriptors of shot entity besides type, face and 
caption. Users can add their needed descriptors such as motion. It should be 
mentioned here, each added descriptor should be assigned with its utility. 
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4.2 The Summarization Method Based on Changeable Threshold 

We give equal weights to the above descriptors and can get each shot entity’s utility 
by the utility function mentioned in section 2. Thus we get a series of utilities corre- 
sponding with each shot entity. We use the utility, which is a number between one 
and zero, to reflect the importance of entities in representing the whole news story 
(Ref. Fig. 2). 



Utility 

Threshold 

0 



A 



Entity 1 ! Entity2 Entity 3 ! Entity4 

Fig. 2. Summarization method based on changeable threshold 



-► 

t 



As fig. 2 shows, the utility values are corresponding with the entities. The higher 
the utility is, the more important the entity is. The summarization is formed from 
those important entities. The threshold in fig. 2 is changeable with users’ queries. For 
example, if we want to get 20% length of the original video, we can set the threshold 
from 1 to the point where all entities above the threshold can form 20% of the origi- 
nal video. Compared with the summarization method based on fixed threshold, this 
method is more representative and the length of summarization is easy to control. 



5 Experiment Results 

To evaluate the effectiveness of the EDU model, a group of usual descriptors were 
chosen for the experiment. The videos were from CCTV night news, Phoenix TV 
news and CCTV world report. The length of the original video was about 40 minutes. 
With the same condensation rate of 20%, we applied three different summarization 
methods to generate video skims, and invited fifteen students who hadn’t seen the 
news before to evaluate the summarization results. 

As fig. 3 shows, these three methods were: (1) Shot sampling and audio sampling; 
(2) Shot sampling and anchor voice choosing; (3) EDU model based summarization. 

The first method was to choose 20% length of each shot from the start as the sum- 
marization. The second method was the same as the first; the difference was that it 
chose the voice of anchor shots. The third method was our proposed method based on 
EDU model; it also chose the voice of anchor shots. 
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(2) Shot sampling and anchor voice choosing: 
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(3) EDU model based summarization: 
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Selected part □ Discard part 
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Fig. 3. Sketch map of three summarization methods 



To avoid the inter-disturbance of different summarization methods, we grouped 
fifteen students into three groups, and tested the three methods. Each group only 
browsed one kind of summarization in a test. For example, as to Night News, group A 
watched the summarization generated by the method one; group B watched the sum- 
marization generated by the method two; group C watched the summarization gener- 
ated by the method three. After each test, they would take turns to let each group have 
the chance to watch other types of summarization. 

After browsing each kind of summarization, each audience should answer five 
questions, namely time, space, relevant person, and event of the news and the fluency 
of the summarization. Then let them watch the original videos and gave score to each 
method based on their understanding of the summarization. For example, if the stu- 
dent had caught the happening time of the news story, then gave the score 10, other- 
wise, gave zero. 

After three tests, we got the average score of each question. The results were as 
follows: 



Table 1. Test results 





Time 


Space 


Person 


Event 


Fluency 


Method 1 


5.4 


4.2 


6.8 


2.4 


2.4 


Method2 


8.0 


9.0 


7.2 


6.5 


5.0 


Method3 


8.5 


9.6 


8.9 


7.4 


7.5 



From table 1 , we could see that the first method was not fluent, and was difficult to 
understand for users. The second method was better, mainly because of the voice 
accompanying with anchor shots. But its fluency was also low. The third method 
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generated summarization based on EDU model, and users could understand the news 
better. 



6 Conclusions 

In this paper, a unified video summarization model called EDU model was proposed. 
The model started from video entity, got utilities after descriptions, and finally gener- 
ated video summarization by the utilities. As a general expandable model, EDU had a 
description scheme to save preprocessed information, and a utility function based on 
descriptions. Users could add their descriptors according to their needs. We could 
realize different levels of video summarizations. In the test, we chose summarizing 
news stories as an example, and proved the efficiency of the EDU model. In the fu- 
ture, more work will be done on the model, and we will try to apply this model to 
other summarization fields. 
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Abstract. In this paper, we propose a novel news video mining method based 
on statistical analysis and visualization. We divide the process of news video 
mining into three steps: preprocess, news video data mining, and pattern visu- 
alization. In the first step, we concentrate on content-based segmentation, clus- 
tering and events detection to acquire the metadata. In the second step, we per- 
form news video data mining by some statistical methods. Considering news 
videos’ features, in the analysis process we mainly concentrate on two factors: 
time and space. And in the third step, we try to visualize the mined patterns. 
We design two visualization methods: time-tendency graph and time-space 
distribution graph. Time-tendency graph is to reflect the tendencies of events, 
while time-space distribution graph is to reflect the relationships of time and 
space among various events. In this paper, we integrate news video analysis 
techniques with data mining techniques of statistical analysis and visualization 
to discover some implicit important information from large amount of news 
videos. Our experiments prove that this method is helpful for decision-making 
to some extent. 



1 Introduction 

We come into contact with abundant news videos everyday, and can get a lot of in- 
formation from them. Most news videos are actual and in time, so they are always 
helpful for people to make important decisions. For example, almost all investors 
have learned to find promising areas and industries by watching television to decide 
their investments. Some important news events would affect people’s daily lives 
seriously, such as the epidemic SARS which happened not long ago did really have 
passive effect on various trades of the world, especially the tourism and people’s 
living conceptions. When such events happen, we all wonder how and how long will 
it affect us. Will the situation become worse? All these suggest that news videos 
should not be one-off consumptions, but should be stored and analyzed seriously. By 
analyzing and mining large numbers of news video programs, we can find out valu- 
able information to help people make their important decisions. 

News video mining is a rising research field, which belongs to multimedia mining. 
Multimedia mining has become a hot research area in these years; it is an intercross 
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subject developed from various areas including multimedia database, data mining, 
information system, etc. and is defined as the process of discovering the implicit and 
previously unknown knowledge or interesting patterns from a massive set of multime- 
dia data. Since the beginning of the first international conference of MDM/KDD in 
2000, more and more scholars pay attention to multimedia mining. Many new con- 
ceptions, methods and framework theories of multimedia mining have been proposed 
[1], but most are confined to spatial data and image data mining. However, researches 
on video data mining are still in its infancy. Generally, there are three types of videos 
[1]: the produced, the raw, and the medical video. JungHwan [1] proposes a general 
framework and some methods of raw video mining. As to news video mining, much 
work is still to be done, Wijesekera [2] discusses the problem of applying traditional 
data mining methods to cinematic video mining; Kim [3] incorporates domain knowl- 
edge with audio-visual analysis in news video mining; Kulesh [4] proposes a person- 
alized news video access method by the exposition of their PERSEUS project. 

As mentioned above, there have been some efforts about video data mining in 
various fields. In this paper, we aim at discovering the implicit information in news 
videos. First, we extract and analyze statistically the news video content, and then 
propose two novel visualization methods: time-tendency graph and time-space distri- 
bution graph. Time-tendency graph is to reflect the tendencies of events, while time- 
space distribution graph is to reflect the spatial-temporal relationships among various 
events. These two visualization methods can be useful for decision-makers. 

Figure 1 shows the flowchart of news video data mining. In this chart, we divide 
the process of news video mining into three steps: preprocess, video data mining (for 
example, statistical analysis), and pattern visualization. 




Fig. 1. Flowchart of news video data mining 

Corresponding to the above flowchart, we organize this paper as follows: Section 2 
is the data preprocess of news video, including some preparation work such as shot 
boundary detection, clustering and event detection; section 3 is the statistical analysis 
of news stories; section 4 proposes two visualization methods, namely time-tendency 
graph and time-space distribution graph; Finally we give our concluding remarks in 
section 5. 
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2 Preprocess of News Video Data 

As we know, video data is a kind of unstructured stream. In order to be recognized by 
computers, these data need to be preprocessed. In this stage, we will accomplish 
video syntactical segmentation and semantic annotation. In another word, we will 
segment news videos into meaningful units such as shots, scenes, stories, etc. We also 
try to get the semantic content of each unit in this stage. 

Many content-based segmentation methods have been proposed. We adopt the 
method of comparing the color histograms of neighboring frames to decide the shot 
boundaries [6], and then use A:-means method to cluster similar shots into scenes. By 
analyzing the time relationships between shots in a scene, we can get a series of news 
stories. These segmented stories will be the metadata for later video mining. So we 
will pay more attention to this basic unit. 

Moreover, some semantic events in news videos may be interesting patterns to 
most users. Here, we mainly discuss two kinds of semantic events: caption event and 
face event. Some semantic events are periodic and repeated, for example, the caption 
events will happen periodically in the whole video stream and sometimes appear 
repeatedly in a news story. Some have not such obvious features of periodicity and 
repetition, such as face events in video streams. 

We adopt the method proposed by Tang [9] to detect caption events in news vid- 
eos. As to face detection events, we adopt the object detection method proposed in 
[5]. Then we set some basic rules to exclude small face events and reserve those fea- 
ture face events. 

Feature extraction is in parallel with the processes of shot boundary detection, 
clustering and event detection (Ref. Fig. 1). In this process, we extract features of 
frames, shots, scenes, stories, faces, captions and some other description information, 
such as the occurrence time and space of news, which is achieved from speech recog- 
nition. All these features will be stored in the database. 



3 Statistical Analysis of News Stories 

After the preprocessing of news videos, we can get a series of news stories called 
metadata. But we still couldn’t find out interesting patterns by arranging them in 
linear order only. To most decision-makers, they pay more attention to important 
news stories, which could give them profound impression and could effect greatly on 
their final decisions. 

3.1 News Stories’ Importance Model 

Generally, important news would imply more information, and would always be the 
focus of the public. Therefore, it is necessary to model the news importance. Our 
news importance model would be considered from the following aspects: 
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Sources. It’s understandable that news from authoritative TV stations would be more 
important and authoritative than those from local ones. Based on this, we divide news 
sources into five levels, namely world, country, province, city and county level. Then 
assign different importance values to different levels. The importance of the highest 
level is 1, and the others will decrease in size by 0.2 for each level. For example, 
news reported by the world level TV stations such as Reuters in British, CNN in 
America would be assigned with the highest importance value, while those reported 
by county TV stations would be set with the lowest one. 

Playtime. News played in golden time would be more important than those not. In 
the same way, we assign different importance values to 24 periods of time in a day 
and assign higher importance values to golden time, such as A.M. 7, midday and 7 to 
8 o’clock in the evening, etc. while news reported in midnights is always replay or 
unimportant one, and should be assigned with a lower importance value. 

Reported Frequency. If the news is reported by several TV stations, or reported and 
tracked by a TV station several times, then we can believe that the news is important. 
Here we introduce the concept of duplicate in news. So-called duplicate means the 
same news reported by different TV stations from different points of view. Sometimes 
they are not visually similar with each other and can hardly be recognized by 
traditional algorithms. This forms the redundancy of video data and has bad influence 
on the search result. To solve this problem, Jaims [7] proposes a framework to detect 
duplicates. Based on this framework, we have designed a news duplicate detection 
method that integrates the algorithms of image and audio similarity matching, face 
matching, and voice scripts matching, and can detect news duplicates. 

Play Order. In the same news program, news reported in the front would be more 
important. So the play order of news should be one of the important factors. This 
feature is more like the makeup of newspapers that the most important news is always 
arranged in the front page. We set the importance of the start time in one news 
program to be 1 , and the importance of the end time in this news program to be 0. In 
this way, we build an inverse coordinate axis (Ref. Fig. 2). Each news story’s 
importance in the program can be computed by their starting time and is defined as P i 
( P. e (0,1]). For example, the thick line in Fig. 2 means a news story in a news 

program, according to its start time, its relative position P =0.7, which is also the news 
story’s play order importance. 

Duration. Generally, the duration of important news stories is longer than those not. 
For example, commonly, the news about the Iraq War lasts five to ten minutes, while 
other news lasts only two or less minutes. To some extent, the duration of news can 
suggest the importance of news stories. 
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Feature Face. After having detected the feature faces in section 2, we use the de- 
tected results to analyze the importance of news stories. For example, news appearing 
leaders’ feature faces should be more important than those appearing only civilians’. 

I — >> 

1 0.7 0 

Fig. 2. Sketch map of relative position of a news story 

According to all the characters mentioned above, we extract and save the necessary 
information of news stories, including play time, duration, play order, etc. And then 
propose the following news importance model. 

Supposing /, /, Ij, / , I r I are importance measurement units, which mean the im- 
portance of a news story’s source, playtime, play times, play order, duration and fea- 
ture face importance respectively. And accordingly, w, to w 6 are six corresponding 
weights assigned to them. Then a simple linear combination model can represent the 
importance model of news stories as follows: 

I = wj s + w 2 I, + w,I d + w 4 I 0 + w 5 I,+w 6 I f ( 1 ) 



Where 



6 

= 1’ IJi G [0,1], 16 [ s,t,d,o,l,f ] 



3.2 Statistical Analysis of Time and Space 

By using the importance model mentioned above, we can get a series of important 
news stories called topics. Since single news can’t reflect the tendency and develop- 
ment of a topic, or discover the relationships of time and space between different 
news, it is necessary to adopt the method of statistical analysis to find them out. 

Considering the factors of time, place, person and event of news, we focus on the 
factors of time and place. By analyzing these two factors, we can understand the rela- 
tionships between the news more accurately. 

To be mentioned here, our statistical analysis is performed on the same topic, for 
example, the topic of SARS. First, we calculate the event occurrence frequency along 
the time axis. As we know, most topics last a period of time, it maybe a week, a 
month or even longer. At different time point, relevant news of the same topic may 
happen more or less, and they form the developing process of the topic. We hope to 
find the implicit tendency among news stories by the statistical analysis along the 
time axis. 

Given a topic /, supposing there are C, news stories relevant to the topic happened 
on the same day t, and the importance of the news story i is /,, then the topic’s total 
importance on day t would be S( I, t): 

S(I, 0 = Za 

1=1 



( 2 ) 
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Thus, the time importance function of the topic is founded. For example, there are 
four pieces of relevant news about SARS happened on April the 19 th , by computing 
each news importance (I p /,, I 3 , I 4 ), we can conclude that the importance of SARS on 
April the 19 th will be S( SARS4/19 )=I 1 +I 2 +I 3 +I 4 . This importance can reflect the graveness 
degree of SARS on April the 19 th . 

In the same way, we can calculate the locations mentioned in the news and try to 
find the region centrality. In our experiment, we build a place database, in which 
many countries, regions and cities appeared in news are included. We obtain the oc- 
currence place of each news by speech recognition and extracting place names. 

Similar to the statistic of time axis, given a topic /, we calculate the topic’s impor- 
tance happened on the place p according to this formula: 

r , r ( 3 ) 

S(l,p) = z_ I l 

/— i 

Where S(I, p) means the topic’s importance of the place p, Cp means the number of 
the news relevant to the topic on place p, and / means the importance of the z'th news 
story. For example, from April the 1 2 th to April the 24 Ih , there are 14 pieces of news 
related with SARS happened in Beijing. We compute each news importance 
/, 4 ), and get the importance of SARS happened in Beijing S( sars Beijing ) 
=I I +I 2 +I;+...+I I4 



4 Visualization of Mined Patterns 

After the mining process of each stage, we have got some interesting patterns. Next, 
we will try to visualize these interesting patterns. 

CMU university uses the method of timeline and map to visualize news videos [8], 
we are inspired to design two visualization methods: time-tendency graph and time- 
space distribution graph. Time-tendency graph is used to reflect the process of topics; 
time-space distribution graph reflects the relationships between time and space, and is 
helpful for decision-makers as a whole. 

We collect the Night News of CCTV (an authoritative news program in China) 
from April the 12 th to April the 24 Ih in 2003 as experiment data set. Fig. 3 is an exam- 
ple of time-tendency graph which indicates the time tendency of the topic SARS. The 
horizontal axis represents time, and the vertical axis represents the importance of the 
topic. We can see from Fig. 3 that from April the 12 , the news importance of SARS 
is arising, which indicates that SARS has become more and more serious in the world 
and media have paid more attention to it day by day. 

In our system, if anyone is interested in the news stories on a certain day. he can 
browse the detailed news information by choosing the key frames. Fig. 4 is a user’s 
browsing result of news stories relevant to SARS on April the 21 st . 

Fig. 5 and Fig. 6 are time-space distribution graph; they reflect a topic’s current 
situations from the point of view of time and space. In another word, we add time axis 
to the map, thus we can get a topic’s spatial distribution on a certain time point. The 
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Fig. 3. Time-tendency graph of the topic 
SARS 
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Fig. 4. Browsing the news stories relevant to 
SARS on April the 21 st 



red dots in the graph indicate the occurrence places of news stories; the sizes of them 
indicate the importance of the topic. By sliding the scroll bar below the map graph, 
we can choose to browse the time-space distribution of a day or a period of time. In 
Fig. 5, we choose April the 12 , and in Fig. 6, we choose April the 24 th . Comparing 
these two graphs, the red dot located in Beijing in Fig. 6 is bigger than in Fig. 5, while 
the red dot located in GuangDong is quite the opposite. This indicates that from April 
the 12 th to April the 24 th , news reported about SARS in GuangDong has decreased, 
while is increased in Beijing. Then we can deduce that the situation of SARS in 
GuangDong is under control, while SARS in Beijing become worsening, which ac- 
cords with the facts. 




Fig. 5. Time-space distribution graph of 
SARS on April the 12 th 




Fig. 6. Time-space distribution graph of 
SARS on April the 24 li 



By adopting the methods we proposed above, we can finally draw the conclusions 
that in the period from April the 12 th to April the 24 th , SARS distributes mainly in 
Beijing and GuangDong. Among them, SARS in Beijing is more serious, and SARS 
in GuangDong is under control. According to these conclusions, decision-makers can 
adjust their investment and policy correspondingly to avoid losing or gain more 
profit. 
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5 Conclusions 

In this paper, a news video mining method based on statistical analysis and visualiza- 
tion is proposed. According to news video’s features, the method analyzes the news 
topics from two factors: time and space, discovers interesting patterns in the news, 
and designs two visualization methods: time-tendency graph and time-space distribu- 
tion graph. Our primary experiments prove that this method is helpful for decision- 
making. News video mining is a representative and a promising research area in the 
field of multimedia mining. The framework and methods of news video mining is still 
in its infancy. We believe that news video mining would have great influence on 
various fields, such as information analysis, strategic decision and enterprise pro- 
gramming, etc. 
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Abstract. We are building a broadcast news video archive where topics 
of interest can be retrieved and tracked easily. This paper introduces a 
structuring method applied to the accumulated news videos. First they 
are segmented into topic units and then threaded according to their mu- 
tual relations. A user interface for topic thread-based news video retrieval 
is also introduced. Since the topic thread structure is formed so that it 
has fewer number of emerging links from each topic than a simple link 
structure of related topics, it should lessen the tedious selection during 
a tracking process by a user. Although evaluation of the effect of thread- 
ing and user study on the interface is yet to be done, we have found the 
interface informative to understand the details of a topic of interest. 



1 Introduction 

Broadcast video, especially news video contains a broad range of human activities 
which could be considered as a valuable cultural and social heritage. We are 
building a broadcast news video archive where topics of interest can be retrieved 
and tracked easily. The archive is supported by an automatic archiving system, a 
back-end contents analysis process, and a front-end user interface. In this paper, 
we will mainly focus on introducing the back-end contents analysis process, where 
the accumulated news videos are segmented into topic units and then threaded 
according to their mutual relations, and the front-end user interface. 

The automatic archiving system captures and records broadcast news video 
streams including closed-caption texts (transcripts of audio), while the meta 
data are stored in a relational database. Currently, we have approximately 495 
hours (312 GB of MPEG-1 and 1.89 TB of MPEG-2 format videos, and 23.0 MB 
of closed-caption texts) in the archive, obtained from a specific Japanese daily 
news program since March 16, 2001 (1,034 days in total). Each night, after the 
day’s program is added to the database, the back-end contents analysis process 
will run. The process will be finished by the next morning so that a user can 
browse through the archive that reflects the topics added the previous night. 
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Fig. 1. Part of a topic thread structure extracted from the archive. Topics are labeled 
in the following format “ Year /Month/Day- Topic#”. 



Topic thread structure in a news video archive. A news video archive 
may seem merely an accumulation of video files recorded every day. Majority of 
previous works on news video structure analysis concentrated on segmenting a 
video stream into semantic units such as topics for retrieval (The most recent 
one: Yang et al. 2003). However, such retrieval is efficient only while the size 
of the archive remains relatively small. Once the archive grows larger, even the 
selection among the retrieved units becomes tedious for a user. Although several 
groups are dealing with news video archives of a comparable size with ours 
(Merlino et al. 1997; Christel et al. 1999), they do not look into the semantic 
relations between chains of topics. Works in Web structure mining is somewhat 
related to our work, but the existence of the chronological restriction makes our 
target substantially different. 

We consider that linking semantically related topics in chronological order 
( threading ) should be a solution to overcome this problem by providing a user 
with a list of topic threads instead of a whole list of individual topics. Once the 
topic thread structure of the entire archive has been revealed, it will no longer 
be a mere accumulation of video files, but a video archive where the threads 
complexly interweave related topics. Figure 1 shows an example of a topic thread 
structure starting from a topic of interest, which was actually obtained from the 
archive by the method proposed in this paper. As seen in the example, topic 
threads merge and diverge along time reflecting the transition of a story in the 
real world. Compare Fig. 1 with Fig. 2 which shows a simple link structure of 
related topics sorted chronologically. When providing a topic tracking interface 
by showing topics linked from a topic of interest, the fewer the number of links 
from each node exist, the less tedious the selection should be for a user. 

In the following Sections, the topic segmentation and threading methods are 
introduced, followed by the introduction of a thread-based browsing interface. 

2 Topic Structuring 

2.1 Topic Segmentation 

A news topic is a semantic segment within a news video which contains a re- 
port on a specific incident. Compared with topic segmentation in general docu- 
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Fig. 2. Example of a simple topic link structure without threading. 



ments, topic boundaries in a news video should be relatively clear, since a news 
video is naturally a combination of originally individual topics. Following this as- 
sumption, we will detect a topic boundary by finding a point between sentences 
where the keyword distributions within preceding and succeeding windows are 
distinctly different. The ideal window sizes are those when they are exactly the 
same with the actual topic lengths on both sides. Since the actual topic length is 
unknown beforehand, we will set elastic windows at each point between sentences 
to evaluate the discontinuity in various window sizes. 



Procedure. The following steps were taken to detect a topic boundary from a 
closed-caption text broadcast simultaneously with the video. The closed-caption 
currently used is basically a transcript of the audio speech, though occasionally 
it is overridden by a source script or omitted when a superimposed caption is 
inserted in the video. 

1. Apply morphological analysis to each sentence of a closed-caption text to 
extract compound nouns. A Japanese morphological analysis software, JU- 
MAN (Kyoto Univ. 1999) was employed. Compound nouns were extracted 
since combination of adjacent nouns was considered as more distinctive to 
represent a topic, rather than a group of individual nouns. 

2. Apply semantic analysis to the compound nouns to generate a keyword fre- 
quency vector for each semantic class (general, personal, locational / organi- 
zational, or temporal) per sentence ( k g , k v ,ki,kt), which has frequencies as 
values. A suffix-based method (Ide et al. 2003) was employed for the analy- 
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sis, which classifies compound nouns both with and without proper nouns, 
according to suffix dictionaries for each semantic class. 

3. At each boundary point between sentences i and i + 1, set a window size 
w, and measure the difference of keyword distributions between w preceding 
and succeeding sentences. The difference (or rather resemblance) is defined 
as follows, where i = w,w + l,...,i max — w when i ma x is the number of 
sentences in a daily closed-caption text, and S = {g,p,l,t}. 



Rs,w(i) 



E 



m=i-w -\- 1 



fc s( m ) • E 



i-\-w 
n=i -\- 1 



k s (n) 



E 



i 

m=i-w -\- 1 



k s {m) 



E i-\-w 

n=i -\- 1 



k s {n) 



(1) 

(2) 



We set w = 1, 2, ..., 10 in the following experiment. 

4. The maximum of Rs,w(i) among all w is chosen at each boundary as follows. 

R s {i) = maxfl Sm (i) (3) 

W 

In preliminary observations, although most boundaries were correctly de- 
tected regardless of w, there was a large number of over-segmentation. We 
considered that taking the maximum should mutually compensate for over- 
segmentations at various window sizes, due to the following tendencies. 

— Small w: Causes numerous over-segmentations, but has the advantage of 
showing significantly high resemblance within a short topic. 

— Large w: Does not always show high similarity within a short topic, but 
shows relatively high resemblance within a long one. 

5. Resemblances evaluated in separate semantic attributes are combined as a 
weighted sum as follows. 

R ( i )=^Za . - , n a s Rs(i) (4) 

Different weights are assigned to each semantic class under the assumption 
that certain attributes should be more important than others when consid- 
ering topic segmentation especially in news texts. 

Multiple linear regression analysis was applied to manually segmented train- 
ing data (consists of 39 daily closed-caption texts, with 384 manually given 
topic boundaries) to determine the weights. The following weights were ob- 
tained as a result. 



(a g , a p , ai,a t ) = (0.23, 0.21, 0.48, 0.08) (5) 

The weights show that temporal nouns ( e.g . today, last month) are not dis- 
tinctive in the sense of representing a topic, where the other three, especially 
locational / organizational nouns act as distinctive keywords. 

Finally, if R(i) does not exceed a certain threshold 9 seg , the point is judged 
as a topic boundary. 
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6. To concatenate over-segmented topics, create a keyword vector Ks for each 
topic, and re-evaluate the resemblances between adjoining stories i and j ( = 
* + 1) by the following function. 



R(hj) = 






K s (i ) ■ K s ( j) 

aS |iT s (i)||iT s (j)| 



( 6 ) 



As for as, the same weights as in Equation 5 were used. 

If R{i,j) does not exceed a certain threshold 9 cat , the topics are concate- 
nated. This process continues until no more concatenation occurs. 



Experiment and evaluation. The procedure was applied first to the training 
data used in Step 5. to define the thresholds ( 9 seg = 0.28 ,9 cat = 0.08), and 
later to the entire archive ranging from March 16, 2001 to April 9, 2004 (1,034 
days with 132,581 sentences in total). The whole process takes approximately 
5 seconds per day on a Sun Blade-1000 workstation with dual UltraSPARC- 
Ill 750MHz CPUs and 2GB of main memory. As a result, 13,211 topics with 
more than two sentences were extracted. Topics with only one sentence (25,403 
topics) were excluded since they tend to be noisy fragments resulting from over- 
segmentation. 



Table 1. Evaluation of topic extraction. 



Condition 


Both ends strict 


One end strict / loose 


Both ends loose 


Recall 


30.0% 


34.6% 


95.4% 


Precision 


28.5% 


32.8% 


90.5% 



Evaluation was performed by applying the same procedure to manually seg- 
mented test data (consists of 14 daily closed-caption texts with 130 topics, set 
aside from the training data), which resulted as shown in Tab. 1. Boundaries were 
counted as correct if they matched exactly with the manually determined ones 
in ‘strict’ condition, and allowing ± 1 sentences in ‘loose’ condition. The ‘loose’ 
condition is acceptable for our goal since sentences at true boundaries tend to 
be short and relatively less informative regarding the main news content. 

2.2 Topic Threading 

A topic thread structure starting from a topic of interest is formed so that related 
topics are linked in chronological order. The difference between simply expanding 
related topics node by node as in Fig. 2 is that the proposed method replaces 
a link to a subordinate node if possible. The structure will therefore be rather 
flat ; few branches at each node, and a long sequence of related topics instead. 
By this method, a topic that may eventually be reached in the tracking process 
will be pushed dowu so that a user needs not select among dozens of topics that 
lre/she would never even need to see. 
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Case 1 : the descendants of its siblings 





If an identical 
sub-tree exists in ... 



Case 2: the descendants of its ancestors 
(except its parent) 




Fig. 3. Topic threading scheme. 



Procedure. The thread structure is formed by the following algorithm. 

1. Expand a topic link tree starting from the topic of interest so that it satisfies 
the following conditions. 

a) Children are topics related to a parent, under the condition that their 
time stamps always succeed their parent’s chronologically. 

b) Siblings are sorted so that their time stamps always succeed their left- 
siblings’ chronologically. 

The resemblance between topics are evaluated by Equation 6. When R(i,j) 
exceeds a threshold 9 tr k , the topics are considered as related. This procedure 
forms a simple topic link tree such as the structure in Fig. 2. 

Since evaluating numerous resemblances between various topics consumes 
too much time for real time processing in the user interface, resemblances 
between all possible topic pairs are evaluated beforehand. Currently, it takes 
roughly 1,400 seconds to add one new topic (comparing one topic against 
approximately 12,000 topics), which will keep on increasing as the archive 
grows larger. 

2. For each sub-tree T s [i ), if an identical sub-tree T s (j) exists on the left-side, 
perform either of the following operations. 

a) Remove T s (i) if T s (j) is a descendant of T s (i )' s sibling. 

b) Else, merge T s (i) with T s (j) if T s (j) is a descendant of T s (i)’s ancestor 
except its parent. 

The sub-tree is removed in (a) instead of merging, to avoid creating a short- 
cut link without specific meaning. The removal and merger scheme is shown 
in Fig. 3. As a result of this operation, the thread structure will form a 
chronologically-ordered directed graph. 

Note that this is in the case of forming a succeeding thread structure. A chrono- 
logically opposite algorithm is applied when forming a preceding thread struc- 
ture. 

To reduce computation time, the following conditions are applied in practice. 

1. Pruning: Perform Step 2. whenever an identical story is found during the 
expansion of the tree in Step 1. 
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2. Approximation: Interrupt the expansion of the tree at a certain depth N tr k- 

Although Condition 2. approximates the result, this will not affect much when 
referring to the direct children of the root (topic of interest) if N tr k is set to an 
appropriate value (We found N tr k = 3 ~ 5 as sufficient in most cases). 



3 Topic Thread-Based Video Retrieval 

We built a topic retrieval interface, namely the “Topic Browser” , so that a user 
can browse through the entire news video archive by tracking up and down the 
topic threads. The interface consists of a “Topic Finder” and a “Topic Tracker”, 
which can be switched by tabs. 



The “Topic Finder”. The “Topic Finder” is the portal to the tracking process; 
it retrieves a list of topics that contain the query term (Figure 4). A topic is 
represented by its meta data (date, time, topic ID), a thumbnail image (the first 
frame of the associated video), and an excerpt of the associated closed-caption 
text. The user can select the initial topic for tracking among the listed topics 
by actually viewing the video and associated close-caption text displayed on the 
right side of the interface. 



The “Topic Tracker”. Once the user selects an initial topic, lre/she will choose 
the “Topic Tracker” tab. To provide a list of topic threads starting from the 
initial topic, it performs the threading process as described in Sect. 2.2 on the 
fly (Figure 5). A topic thread is represented by the first topic in it, and key 
phrases that represent it so that the user can distinguish the difference with 
other threads. The first topics are selected to represent the threads since they 
were evaluated that they do not resemble each other, thus are considered as 
nodes where the topics diverge. The key phrases are noun sequences selected 
exclusively from the representative topic so that they do not overlap with those 
in other threads. The interface allows the user to set d tr k, N tr k to adjust the 
number of threads to be displayed and the computation time. 

The user will keep on selecting a topic over and over until lre/she understands 
the details of the story during the tracking process, or finally finds a certain topic. 
The tracking direction is switclrable so that it could go back and forth in time. 

While “Topic Tracking” is in general a part of the “Topic Detection and 
Tracking (TDT) task” (Wayne 2000) in the TREC Conference, their definition 
of tracking and detection is somewhat static compared to what we are trying to 
realize in this interface. The point is that the proposed topic threading method 
extracts various paths that gradually track topics of interest that a user may 
follow requires our tracking to be more dynamic. 
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Fig. 4. The “Topic Finder” interface. Result of a query “Bin Laden”. 
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Fig. 5. The “Topic Tracker” interface 
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4 Conclusions 

We have proposed a method to reveal a topic-thread structure within a large- 
scale news video archive. The thread structure was applied to a video retrieval 
interface so that a user can track topics of interest along time. Since even the 
most relevant topic-based video retrieval method (Smeaton et al. 2003) considers 
the topic structure as simple links of related topics, the proposed approach is 
unique. Although precise evaluations and user studies are yet to be done, we have 
found the interface informative to understand the details of a topic of interest. 

We will further aim at integrating image-based relations employing such 
methods as described in (Yamagishi et al. 2003) to link video segments that 
are related by semantics that could not be obtained from text. Precision of 
the tracking might be improved by refining the relation evaluation scheme by 
comparing a topic to a group of topics in a thread. A user study will also be 
performed to improve the retrieval interface after introducing relevance feedback 
in the tracking process, refining the keyword/tlrumbnail selection scheme, and so 
on. Evaluation of the method to the TDT (Wayne 2000) corpus is an important 
issue, though the system will have to be adapted to non-Japanese transcripts. 
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Abstract. Text retrieval from broadcast news video is unsatisfactory, 
because a transcript word frequently does not directly ‘describe’ the shot 
when it was spoken. Extending the retrieved region to a window around 
the matching keyword provides better recall, but low precision. We im- 
prove on text retrieval using the following approach: First we segment 
the visual stream into coherent story-like units, using a set of visual news 
story delimiters. After filtering out clearly irrelevant classes of shots, we 
are still left with an ambiguity of how words in the transcript relate 
to the visual content in the remaining shots of the story. Using a lim- 
ited set of visual features at different semantic levels ranging from color 
histograms, to faces, cars, and outdoors, an association matrix captures 
the correlation of these visual features to specific transcript words. This 
matrix is then refined using an EM approach. Preliminary results show 
that this approach has the potential to significantly improve retrieval 
performance from text queries. 



1 Introduction and Overview 

Searching video is very difficult, but people understand how to search text doc- 
uments. However, a text-based search on the news videos is frequently errorful 
due to several reasons: If we only look at the shots where a keyword was spoken 
in a broadcast news transcript, we find that the anchor/reporter might be intro- 
ducing a story, with the following shots being relevant, but not the current one. 
A speech recognition error may cause a query word to be mis-recognized while 
it was initially spoken during a relevant shot, but correctly recognized as the an- 
chor wraps up the news story, leaving the relevant shot earlier in the sequence. 
Expanding a window of shots around the time of a relevant transcript word may 
boost recall, but is likely to also add many shots that are not relevant, thereby 
decreasing precision. Simple word ambiguities may also result in the retrieval of 
irrelevant video clips ( e.g . is Powell, Colin Powell [secretary of state], Michael 
Powell [FCC chairman] or the lake?). 

In this paper we lay out a strategy for improving retrieval of relevant video 
material when only text queries are available. Our first step segments the video 
into visually structured story units. We classify video shots as anchors [7], com- 
mercials [5], graphics or studio settings, or ’other’ and use the broadcast video 
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editor’s sequencing of these shot classes as delimiters to separate the stories. 
Separate classifiers were built to detect other studio settings and shots contain- 
ing logos and graphics using color features. In the absence of labeled data, these 
latter two classifiers were built interactively. Using color features all shots were 
clustered and presented to the user in a layout based on a multi-dimensional scal- 
ing method. One representative is chosen from each cluster. Cluster variance is 
mapped into the image size to convey confidence in a particular cluster. Clusters 
with low variance are manually inspected and clusters corresponding to classes 
of interest are selected and labeled. Story boundaries are postulated between the 
classified delimiters of commercial/graphics/anchor/studio shots. Commercials 
and graphics shots are removed. Anchor and studio/reporter images are also 
deleted but the text corresponding to them is still used to find relevant stories. 

The final step associates the text and visual features. On the visual side, we 
again create color clusters of the remaining (non-delimiter) shots. In addition we 
also take advantage of the results from existing outdoor, building, road and car 
classifiers. Finally, face detection results are also incorporated, grouping shots 
into ones with single faces, two faces, and three or more faces. On the text side, 
the words in the vocabulary are pruned to remove stop words and low frequency 
words. Then, co-occurrences are found by counting the associations of all words 
and all visual tokens inside the stories. The co-occurrences are weighted by the 
TF-IDF formula to normalize for rare or frequent words. Then a method based 
on Expectation Maximization is used to obtain the final association table. 

We perform standard text retrieval on the query text, but instead of expand- 
ing a global window around the location of a relevant word, the story segmen- 
tation limits the temporal region in which shots relevant to the keyword may 
appear. All shots within a story containing a relevant query word are then re- 
ranked based on the visual features strongly correlated to this word based on 
the association table. This results in clear retrieval improvement over simplistic 
associations between a text word and the shot where it occurred. This approach 
also can help to find related words for a given story segment, or for suggesting 
words to be used with a classifier. 

Our experiments were carried out on the CNN news data set of the 2003 
TREC Video Track [11]. It contained 16650 shots as defined by a common shot 
segmentation, with one key-frame extracted for each shot. 

2 Segmenting Broadcast News Stories Using Classes of 
Visual Delimiters 

Our approach to segmentation relies on the recognition of visual delimiters in- 
serted by the editors of the broadcasts. Specifically, we identify four types of 
delimiters: commercials, anchors, studio settings and graphics shots. While we 
had a large amount of training data for the first two delimiter types, we inter- 
actively built classifiers for the studio settings and graphics shots using a novel 
approach. 
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Table 1. Top: Number of shots detected and removed from each category. Remaining 
number of shots is 9497. Bottom:Number of correctly classified shots. 





anchors 


commercials 


graphics 


in-studio 


# elements 


909 


4347 


1404 


525 


# correct 


818 (90%) 


4304 (99%) 


1303 (93%) 


456 (87%) 



2.1 Commercials, Anchors, Studio, and Graphics Shots 

In news video broadcasts, commercials are often inserted between news stories. 
For efficient retrieval and browsing of news, their removal is essential, as com- 
mercials don’t contain any news material. To detect commercials, we use the 
approach in [5], which combines analysis of the repetitive use of commercials 
over time with their distinctive color and audio features. 

When searching for interesting broadcast news video material, detection and 
removal of anchor and reporter shots is also important. We use the method 
proposed in [7] to detect anclrorpersons. 

Even after the removal of commercial and anchor shots, the remaining video 
still contains more than the pure news story footage. Virtually all networks use 
a variety of easily identifiable graphics and logos to separate one story from 
another, or as a starting shot for a particular news category (for example char- 
acteristic logos appear before sports, weather, health or financial news) or as 
corporate self-identification such as the “CNN headline news” logo. While the- 
oretically possible, it is quite tedious to manually label each of these graphics 
and build individual classifiers for them. However, these graphics appear fre- 
quently and usually have very distinctive color features and therefore can easily 
be distinguished. 

In order to detect and remove these graphics, we developed a method which 
first clusters the non-anclror non-commercial shots using color features, and then 
provides an easy way to select clusters corresponding to these graphics. Similarly, 
shots that include studio settings with various anchors or reporters (apart from 
the straight anclrorperson shots) can also be clustered, selected and removed to 
leave only shots of real news footage. 

There are many ways to cluster feature sets, with differing quality and ac- 
curacy. Applying a K-means algorithm on feature vectors is a common method. 
However, the choice of K is by no means obvious. In this study, we use the idea 
of G-means [6] to determine the number of clusters adaptively. G-means clusters 
the data set starting from small number of clusters, C, and increases C iteratively 
if some of the current clusters fail the Gaussianity test (e.g., Kolmogorov-Smirov 
test). In our study, 230 clusters were found using color based features. The spe- 
cific color features were the mean and variance of each color channel in HSV 
color space in a 5*5 image tessellation. Hue was quantized into 16 bins. Both 
saturation and value were quantized into 6 bins. 

From each cluster a representative image is selected, which was simply the 
element closest to the mean, and these representatives are displayed using a 
Multi Dimensional Scaling (MDS) method. The layout is shown in Figure 1-a. 
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(d) 







(b) 






median 


mean 


anchor 


32.4304 


32.7048 


graphics 


35.2399 


49.8366 


others 


59.7816 


64.7234 



(c) 




Fig. 1 . (a) Representative images for the clusters. Size is inversely related to the vari- 
ance of the cluster, (b) Distribution of variances for all clusters, (c) Mean and median 
variance values for selected graphics clusters, anchor clusters and others, (d) Example 
graphics clusters selected and later removed. 



The size of the images is inversely related to the variance of the clusters. This 
presentation shows the confidence of the cluster. In-studio and graphics clusters 
tend to have less variance than other clusters, and therefore can be easily selected 
for further visual inspection. This process allows very quick review of the whole 
data set to label and remove the in-studio and graphics images. Table 1 shows 
the accuracy of our anchor, commercial, graphics and in-studio detection. Note 
that all detectors have an accuracy of 87% or higher, with commercials over 
99%. We now remove all detected anchor, commercial, graphics and in-studio 
shots from the collection to leave only the shots which are related to news story 
footage. From the original 16650 shots, only 9497 shots were left as news story 
shots. 



2.2 Segmenting with Visual Delimiters 

To segment video news based on the visual editing structure, we devised a heuris- 
tic that uses the detected anchor, commercial, graphics and in-studio shots as 
delimiters. The algorithm is as follows: 

— Start a new story after a graphics or commercial shot. 

— If there is a graphics or commercial in the next shot, then end the story. 

— Start a new story with an anchor shot which follows a non-anchor shot. 

— End a story with an anchor shot if the following is a non-anchor shot. 
Figure 2 shows example segments obtained with the proposed algorithm. 

Most of the stories are correctly divided. Graphics create hard boundaries, while 
the anchor/reporter shots (and their associated text transcripts) are included 
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into both the preceding and following story segments. The reason is that an 
anchor usually finishes the previous story before starting another story. Without 
using textual information, the exact within-shot boundary cannot be accurately 
determined, nor can we tell if the anchor is only starting a story, but not finishing 
the previous one. Therefore, it is safer to add the anchor shots as part of both 
segments. This can be observed in Figure 2-d, where the iconic image in the 
right corner behind the anchor is same at the beginning and end of a story 
segment. However, in other cases, two news stories having different icons are 
merged into one. This problem could be solved with a more careful analysis 
of the icons behind the anchors or through text analysis. Other problems arise 
due to misclassification of the delimiter images. This may cause one story to be 
divided into two, or one story to begin late or end early, as in Figure 2-c and a 
delimiter may not be detected as in Figure 2-e . These problems again could be 
handled with a textual segmentation analysis. 

To estimate the accuracy of our segmentation, 5 news broadcasts (1052 shots) 
were manually truthed for story segmentation. In the 5 broadcasts, 114 story 
segments were found and 69 segments were detected correctly. The errors are 
mostly due to incorrect splits or incorrect merges. In our case, the goal of story 
segmentation is to separate the news into parts for a better textual-visual asso- 
ciation. Therefore, these incorrect segmentations actually are not very harmful 
or sometimes even helpful. For example, dividing a long story into two parts can 
be better since the further words are less related with the visual properties of 
the shots. 

Other approaches that have been proposed for story segmentation are usually 
text-based. Integrated multimedia approaches have been shown to work well, 
however they incur great development and training costs, as a large variety 
of image, audio and text categories must be labeled and trained [2]. Informal 
analysis showed that our simple and cost-effective segmentation is sufficient for 
typical video retrieval queries [7]. 



(a) 

(b) 

(c) 

(d) 

(e) 




[3-5], [5-8] 
[5-] 

[3-5] 

[6-] 



Fig. 2. Example story segmentation results. (The segment boundaries are shown as 
[<starting shot> - <ending shot>], — is used to show that segment continues or 
starts outside of the selected shots.) 
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3 Associating Semantics with Visual Classifiers 

The video is segmented into stories and the delimiters are removed from the 
video as described above. The data now consists of only the shots related to the 
news story and the transcript text related to that story. However, the specific 
image/text association is still unknown. Our goal is to associate the transcript 
words with the correct shots within a story segment for a better retrieval. 

The problem of finding the associations can be considered as the translation 
of visual features to words, similar to the translation of text from one language 
to another. In that sense, there is an analogy between learning a lexicon for 
machine translation and learning a correspondence model for associating words 
with visual features. 

In [3] association of image regions with keywords was proposed for region 
naming and auto-annotation using data sets consisting of annotated images. 
The images are segmented into regions and then quantized to get a discrete 
representation referred as ‘blob tokens’. The problem is then transformed into 
translating blob tokens to word tokens. A probablity table which links blob 
tokens with word tokens is constructed using an iterative algorithm based on 
Expectation Maximization [1] (For the details refer to [3]). A similar method is 
applied to link visual features with words in news videos [4] where the visual 
features (color, texture and edge features extracted from a grid) are associated 
with the neighbor words. However, this method have the problem of choosing the 
best window size for the neighborhood. In this study, we use the story segments 
as the basic units and associate the words and the visual features inside a story 
segment. Compared to [3], in our case, story segments take the place of images, 
and shots take the place of regions. In our study, also visual features are expanded 
with mid-level classifier outputs, which are called ‘visual tokens’. The vocabulary 
is also processed to obtain ‘word tokens’. The next section will give details how 
to obtain these tokens. Then, we describe a method to obtain the association 
probabilities and how they can be used for better retrieval. 



3.1 Extracting Tokens 

We adapt some classifiers from Informedia’s TREC-VID 2003 submission [7]. 
Outdoor, building, car and road classifiers are used in the experiments. Outdoor 
and road classifiers are based on the color features explained in the previous 
section and on texture and edge features. Oriented energy filters are used as 
texture features and a Canny edge detector is used to extract edges. The classifier 
is based on a support vector machine with the power=2 polynomial as the kernel 
function. Car detection was performed with a modified version of Schneiderman’s 
algorithm [10]. It is trained on numerous examples of side views of cars. For 
buildings we built a classifier by adapting the man-made structure detection 
method of Kumar and Hebert [8] which produces binary detection outputs for 
each of 22x16 grids. We extracted 4 features from the binary detection outputs, 
including the area and the x and y coordinates of center of mass of the bounding 
box that includes all the positive grids, and the ratio of the number of positive 
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grids to the area of the bounding box. Examples, having larger values than 
the thresholds are taken as building images. For faces, Schneiderman’s face 
detector algorithm [10] is used to extract frontal faces. Here, shots are grouped 
into 4 different categories: no face (0), single face (1), two faces (2), three or more 
faces (3). Finally, color based clusters are also used after removing the clusters 
corresponding to in-studio and graphics shots. After removing 53 graphics and 
17 in-studio clusters from 230 color clusters, 160 clusters are remained. 

These classifiers are errorful. As shown in Table 2 removing the delimiters 
increases the accuracy of detections, but overall accuracy is very low. Our goal 
is to understand how visual information even if imperfect can improve retrieval 
results. As will be discussed later better visual information will provide better 
text/image association, therefore it is desirable to improve the accuracy of the 
classifiers, and also to create classifiers which are more specific and therefore 
more coherent. 

On the text side, transcripts are aligned with shots by determining when each 
word was spoken. The vocabulary consists of only the nouns. Due to the errorful 
transcripts obtained from speech recognition, many incorrect words remain in 
the vocabulary. To remove stop words we only retained words occuring more 
than 10 times or less than 150 times, which cause the vocabulary to be pruned 
from originally 10201 words to 579 words. 



Table 2. Classifier accuracies. Before: The original detection results on all the shots, 
after: after the removal of anchor, commercial and delimiter shots. Numbers show the 
number of shots detected correctly over all the detected shots. For outdoors due to the 
large number of images half of the data was truthed. Originally the number of detected 
outdoor shots was 5776 after removing anchors, delimiters and comercials. 



classifier 


outdoor 


building 


car 


road 


before 


1419 / 4179 (34%) 


126 / 924 (14%) 


26 / 78 (33%) 


71 / 745 (9%) 


after 


1000 / 2152 (46%) 


101 / 456 (22%) 


14 / 40 (35%) 


40 / 421 (9%) 



3.2 Obtaining Association Probabilities 

Visual tokens are associated with word tokens using a “translation table”. The 
first step is finding the co-occurrences of visual tokens and words by counting the 
occurrence of words and visual tokens for each shot inside the story segments. 
The frequency of visual tokens are in a very large range which is also the case for 
words. In order to normalize the weights we apply tf-idf, which was used success- 
fully in [9] for region-word association. After building the co-occurrence table, 
the final “translation table” is obtained using the Expectation-Maximization 
algorithm as proposed in [3]. The final translation table is a probability table 
which links each visual token with each word. 

Figure 3 shows the top 20 words with the highest probability for some se- 
lected visual tokens. These tokens were chosen for their high word association 
probabilities. We observe that when a cluster is coherent the words associated 
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with it are closely related. Especially for sports and financial news this associa- 
tion is very clear. Building and road classifiers are relatively better than outdoor 
and car classifiers. The reason is that there are not many examples of cars, and 
the outdoor classifier is related to so many words due to number of outdoor 
shots. 

The learned associations are helpful to do a better search. For a text based 
search, first the story segments which include the word are obtained. Then, in- 
stead of choosing the shot which is directly aligned with the query word, we 
choose the shot which has the highest probability of being associated with the 
query word. Figure 4 shows the search results for a standard time based align- 
ment and after the words are associated with the shots using the proposed ap- 
proach. For this example we choose only one shot from each story segment. Using 
sample queries for ‘clinton’ and ‘fire’, we find that 27 of 133 shots include Clinton 
using the proposed method (20% accuracy) while only 20 of 130 shots include 
him when the shots aligned with the text in time are retrieved (15% accuracy). 
For the ‘fire’ query, the numbers are 15/38 (40%) for the proposed approach and 
11/44 (25%) for the time based approach. 




stock, wall, market, street, investor, re- 
port, news, business, jones, industri- 
als, interest, deal, thanks, cnnfn, com- 
pany, susan, yesterday, morris, number, 
merger 




pilot, veteran, family, rescue, foot, 
effort, crew, search, security, troop, 
fact, affair, member, survivor, tobacco, 
field, department, health, communica- 
tion, leader 




series, bull, jazz, playoff, game, confer- 
ence, final, karl, lead, indiana, Utah, 
difference, combination, board, night, 
ball, point, pair, front, team 




company, market, line, worker, street, 
union, profit, wall, cost, news, strike, 
yesterday, rate, quarter, stock, check, 
report, level, fact, board 



Fig. 3. For three color tokens and for the building token, some selected images and the 
first 20 words associated with the highest probability. 




Fig. 4. Search results left: for ’clinton’, right for ’fire’. Top: Using shot text, bottom: 
the proposed method. While, the time based alignment produces unrelated shots (e.g 
anchors for clinton), the proposed system asociates the words with the correct shots. 
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4 Conclusion 

Association of transcripts with visual features extracted from the shots are pro- 
posed for a better text based retrieval. Story segmentation based on delimiters, 
namely anchor/commercial/studio/graphics shots is presented for extracting the 
semantic boundaries to link the shots and text. It is shown that by removing the 
delimiters and finding the associations it is possible to find the shots which actu- 
ally correspond to the words. This method can also be used to suggest words and 
to improve the accuracy of classifiers. As observed in preliminary experiments, 
better visual tokens result in better associations. Having more specific classifiers 
may provide more coherent tokens. In the future we are planning to extend this 
work to motion information which can be associated with verbs. In this study 
only the speech transcript extracted was used. Text overlays can also be used 
for association. 
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Abstract. This paper describes the evaluation of a new component-based 
approach to querying and retrieving for visualization and clustering from a large 
collection of digitised trademark images using the self-organizing map (SOM) 
neural network. The effectiveness of the growing hierarchical self-organizing 
map (GHSOM) has been compared with that of the conventional SOM, using a 
radial based precision-recall measure for different neighbourhood distances 
from the query. Good retrieval effectiveness was achieved when the GHSOM 
was allowed to grow multiple SOMs at different levels, with markedly reduced 
training times. 



1 Introduction 

The number and variety of image collections has risen rapidly over recent years which 
has led to increasing interest in automatic techniques for content based image retrieval 
(CBIR). Most CBIR techniques [1] compute similarity measures between stored and 
query images from automatically extracted features such as colour, texture or shape. 
Trademark image retrieval is currently an active research area [2], both because of 
their economic importance and because they provide a good test-bed for shape 
retrieval techniques. Two main approaches to trademark retrieval can be 
distinguished: the first based on comparison of features extracted from images as a 
whole (e.g. [3]), the second based on matching of image components (e.g. [4]). The 
latter approach appears to be more successful than the former [5]. 

Our previous work [6, 7] investigated the effectiveness of 2-D self-organizing 
maps (SOM) as a basis for trademark image retrieval and visualization. The aim of 
this study is to further investigate how the new component-based matching 
framework [7] scales up to larger image and query collections, and whether 
improvements in training times or retrieval effectiveness can be achieved using an 
adaptive SOM. 



2 Self-Organizing Maps 

The SOM [8] was selected as the basis for our approach to retrieval and visualization 
because it is an unsupervised topologically ordering neural network, a non-linear 
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projector, able to locally and globally order data and capable of accepting new data 
items without extensive re-training. It projects high dimensional data x from n-D 
input space onto a low dimensional lattice (usually 2-D) of N n-D neighbourhood 
connected (usually rectangular or hexagonal) nodes {r f | i=l,..., N} of weight w,= [w i; , 
w c ,...,wj. 

A limitation of the SOM is that it has a static architecture, which has to be defined 
a-priori. To overcome this, several adaptive architectures have been proposed (e.g. 
[9], [10]). The Growing Hierarchical SOM (GHSOM) [11] is used in this study 
because of its regular shaped growing structure, adaptable growth to the data set, and 
rapid training times. It grows multiple layers where each layer is composed of 
independently growing SOMs. The training parameter T, controls how the SOM 
grows horizontally (increasing in size) to allow it to represent the input space to a 
specific granularity, while parameter t 2 controls the vertical expansion of a SOM node 
that represents too diverse input data to be represented by a new SOM at the next 
layer. This is done by training and growing each SOM independently, inserting rows 

and columns if node i with a mean quantization error (mqe f ) | ^|| W ; — X y || has 

XjSlk 

mqe^Tjmqej, where the data subset I t is the input data set that best matches the parent 
node k, where mqe, is from its parent node in the previous layer (for Layer 1 SOM its 
parent mqe 0 is the mean error from the average input data). Once the SOM stops 
growing, nodes are examined for vertical insertion of a new SOM in the next layer if 
node i has mqe> x,mqe 0 . 



3 Component-Based Matching Framework 

Images such as trademarks consist of a variable number of components. Hence the 
normal SOM training framework, based on a single fixed-length feature vector for 
each image, needs modification. Our component-based matching framework [7] treats 
each component as the basic unit for querying and retrieving, and uses these to train a 
topologically ordered component map. The map is subsequently queried to measure 
component similarities, which are combined to retrieve images, either as a 1-D 
ordered list or a 2-D cluster map. This two-stage process is defined below: 

I. Database Construction: 

Each test image T A and query image Q is part of the image collection I. 

(1) Each image T A , is segmented, into n components T A ={t A 1 1 A t A j , where n 
is variable and dependent on the image. 

(2) A suitable set of L image features measures / I( A is taken for each component 
t A creating a fixed-length component feature vector t A = { / ( ' A ,..., f]]' j . 

(3) These fixed-length feature vectors t A are topologically ordered by training a 

SOM or GHSOM Component Map (CM) (SOM cm , GHSOM cm ); similar 
components from different images should be close. 
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II. Search and Display of Images: 

(4) The query image Q is segmented and feature measures taken as in steps (1) and 
(2) Q = {q | q[ = t^,...,q m = t^} , with m fixed length components. 

(5) Create a 1-D ordered list of similar images (see [7] for full details). 

(6) Create a 2-D similarity cluster map: 

(a) A Component Similarity Vector (CSV) is computed between each query 

component q ; and test components t A , within neighbourhood radius r 

around each query component of the CM. 

(b) Use these vectors to train a 2-D CSV SOM (SOM csv ) map. 

(c) Display the topologically ordered SOM csv map. 



The Component Similarity Vector (CSV) C A , of step (6), has m similarity measures, 
corresponding to the query's m components on the Component Map (CM) for the test 
image T A components: 

. = k:exp(-||r»- r *|,<2^,) . N, 

^ jc A : 0 t A g N f (1) 

C A ={S(q 1 ,t A ),S(q 2 ,t A ),...,S(q,„,t A )} CSV with m components. 

Each query component q is searched a neighbourhood radius r around the CM for 
test image T A components. Each element c A is a measure of how similar the test 
image T A is to the query component q ; , therefore the whole vector C A is a measure 
of similarity to the whole query image. 

This framework was tested successfully on a set of 4 test queries on a collection of 
5268 trademark images [7], using fixed SOMs. 



4 Present Experiments 

Our previous work [7] established that good CSV SOM cluster maps can be generated 
from collections of up to 5000 images. Current experiments aim to answer questions 
about whether this framework will scale up to larger data sets and larger numbers of 
queries, and whether the use of GHSOMs has a beneficial effect on training times or 
retrieval effectiveness. 



4.1 Image Collection and Feature Extraction 

The images were all from a collection of 10745 abstract trademark images provided 
by the UK Trade Mark Registry - originally for the evaluation of the ARTISAN 
shape retrieval system [4], Registry staff also assembled 24 query sets of similar 
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images, from this collection, which are used as independently established ground truth 
for retrieval. 

All test images T A were segmented into boundary components t A by the method 
of [12]. For each component t A , 15 shape descriptors (which proved effective in 
comparative retrieval trials in that study) were extracted: (a) 4 "simple" shape 
features: relative area, aspect ratio, circularity and convexity; (b) 3 "natural" shape 
features proposed in [13]: triangularity, rectangularity and ellipticity; and (c) 8 
normalized Fourier descriptors. 

The result of segmenting the 10745 trademark images was a set of 85860 
component feature vectors, each with 15 elements. 



4.2 SOM and GHSOM Component Map Configuration 

These 85860 component vectors t A were used to train a set of SOM cm and GHSOM cm 
maps (with different levels of horizontal and vertical expansion). Fixed sized SOM rl , 
maps reported here were of size 150x120, 175x150 and 250x200 = 50000 nodes 
(SOM cm25&200 ) to accommodate the 85860 components. The SOM rMJ5a)20 , for example, 
during training had rough ordering parameters of a= 0.25, N 0 = 40 with 90,000 cycles 
and fine tuning parameters of £(=0.002, N=6 with 250,000 cycles and took 198.4 
minutes which is summarized Table 1. All these hexagonal connected maps were all 
initialised in the same way. Larger maps were not used, as training times would have 
been excessive. 

Table 2 shows the configuration of several GHSOM cm maps grown by the selection 
of parameters T, and T,, that all grew from an initial 2x2 Layer 1 SOM. 

4.3 CSV SOM Training 

With the trained SOM cm & GHSOM,,,, an m element CSV C A (1) training vector was 
created for each test trademark image T , for the query Q with m components. These 
CSV vectors were next used to train a small 12x15=180 node CSV SOM (SOM CSW2ti5 ) 
map. A small map was used because many CSV vectors were discarded, as many 
images had no components within the m query neighbourhoods. Finally, for 
visualisation, trademark images were positioned on their SOM csi72i;j best matching 
unit. 



4.4 Retrieval Performance Evaluation 

As with our previous work, retrieval effectiveness has been measured in terms of 
precision, the proportion of relevant items retrieved to the total retrieved, and recall, 
the proportion of relevant items retrieved. Like Koskela et al [ 14], we have modified 
this approach by computing radial precision P(r,Q) and recall R(r.Q) for query Q for 
neighbourhood radius r, which operates as the maximum cut-off r=N co . The average at 
radius r, over all N (in this study 24) query images is defined as P avg (r ) = "E,P(r,i)/N 

and R avg (r)= X R{r,i)l N for Vz={Q,,....Q v }, while the overall average for the whole 
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map is defined as P avg = J^P(r)l L and R avg = X R(> ) / L for Vr={0,...,L-l } out to 
radius L- 1 . 



5 Visual Analysis of Results from GHSOM Component Map 

Fig. 1 gives an example of a GHSOM cm trained using the 85860 components. The top 
level (Level 1) map GHSOM rM) has 40 nodes, all of which have been expand into 
sub-maps. Each node shows the number of components that have been mapped onto 
that node; bracketed numbers give the sub-map number. The first sub-map 
GHSOM CM2 , at Level 2, is populated by square shaped components (four shown on 
each node for clarity). However, GHSOM rmj , also on Level 2, is less uniform but still 
shows a large population of „C“ shaped components. This topological ordering is 
consistence with previous reports [5, 12] that the 15 shape measures used (see section 
4.1) are good descriptors. Similar topological results were found for the large fixed 
size SOM cm maps but because of the large number of nodes (e.g. 175x150=26250 
nodes) these maps would be too compressed to show clearly (see [7] for the smaller 
SOM CM60i5J , from a 268 image set). 



Top level GHSOM cmjj 




Fig. 1 . Three GHSOM cm maps are shown; with square shaped components on sub-map 
GHSOM cm and „C“ shaped components on sub-map GHSOM CM33 . Four components per node 
are shown on terminal nodes, and the number of components on each node 
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Fig. 2. SOM CSVI2xI5 where CSV vectors were created from a (a) SOM CMI75xI50 and (b) GHSOM cm 
(query 1138103 has been circled) 



6 Results and Analysis from Retrieval Experiments 

6.1 Qualitative Analysis of Visual Maps 

The result of creating a set of CSV vectors and training two SOM rw;l(JS maps for 
visualising retrieved trademarks is shown in Fig. 2, for the query 1138103 which has 
been circled and is shown surrounded by similar trademark images. Fig 2(a) ’s CSV 
vectors were created by searching a fixed size SO.V[ fu;7J((J0 with a neighbourhood 
radius of r= 6. Fig. 2(b) was created by searching the trained GMSOM cm of Fig. 1, 
again at radius r= 6. These maps help to make clusters stand out, such as the query 
with expected result set members on Fig. 2(a) and the set of three in the lower left 
hand corner or the left-off-centre group. Qualitatively, clustering appears better using 
the SOM cm of Fig. 2(a) than that using the GHSOM cm of Fig. 2(b). Other query 
observations (from the set of 24) confirm that clusters are visually explicit on these 2- 
D SOM csv maps for this large collection of trademark images for both the large fixed 
SOM cm and dynamic GHSOM cu . 

6.2 Quantitative Results and Analysis 

The average precision-recall P(r)-R(r) graphs for all 24 query sets on the 
SOM cvi , j2a) ., where the CSV vectors were created by searching various CM within a 
neighbourhood radius of r= 2 (from the query component) are shown in Fig. 3, from 
Table 1 & 2. Fig. 3(a) shows the fixed size SOM cm ( SOM CM_r2_ < size > ) , while Fig. 
3(b) shows the GHSOM cm with SOM cm of sizeLl at Level 1, with L2/3 number of 
SOM cm s at Level 2/3 respectively (GHSOM_r2_<sizeLl>_<L2>_<L3>). 

Table 1 show six fixed size SOM cm maps and their overall average precision-recall 
values on the SOM CSWM5 map using CSV vectors from the SOM cm at radii 2, 4 and 6 
neighbourhoods around the query components. Note that the training times involved 
were very long (1-3 hours). 
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Fig. 3. Average precision-recall graph of SOM cswmj for (a) fixed size SOM cm and (b) 
GHSOM cm Component Maps within a search neighbourhood of 2, from Table 1 & 2 



Table 1. CSV SOM Map's overall average precision-recall P -R , at different radii r on SOM 

* or avg avg 7 

Component Map and training times for all 24 queries 













IP** 




H 


150x120 


60.3 


0.459 


0.184 


0.536 


0.115 


0.572 


0.090 


175x150 


83.6 


0.418 


0.190 


0.496 


0.121 


0.541 


0.087 


250x200 


198.4 


0.392 


0.221 


0.459 


0.159 


0.507 


0.116 



Table 2 shows how by varying x, and x,, the GHSOM cm can grow the SOM cm on 
Layer 1 (always one SOM), and vary the number of SOM cm s added in Layer 2 and 3. 
The overall average precision-recall of the SOM csl72t;5 , from the CSV vectors of the 
GHSOM cm maps at radius 2 is given too. Note that the training times - often less than 
a minute - were much shorter than those for the single-layer fixed SOM cm of Table 1. 
The first three GHSOM cm maps with a large x, only grew the Layer 1 SOM cm , and as 
X, was increased its size got smaller and so did the training times. As x 2 gets smaller 
(for fixed Xj) SOM cm maps add SOM maps to the next level. For example, where 
Xj=0.05 and x, shrinks to 0.00004, SOM cm s from Level 2 grew SOM cm s at Level 3. 
SOMs at Level 2 were relatively quick to train as most were small and only a subset 
of the original training data set mapped onto their parent node, to train them. 
Similarly for Level 3 maps, which had smaller training sets. 
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Table 2. GHSOM Component Map stucture for different parameter settings of x, and x 2 - size 
of Layer 1 map, number of maps in Layer 2 & 3, training time, and CSV SOM Maps’s overall 
average precision-recall, for radius r= 2 on GHSOM cm , for all 24 queries 





GHSOM Component Map parameters x, and T, 


L 


0.0001 


0.001 


0.005 


0.005 


0.005 


0.005 


0.001 


T, 


0.04 


0.04 


0.04 


0.004 


0.0004 


0.00004 


0.0004 


Layer 1 (always 
1 SOM) size 


22x26 


8x12 


4x8 


4x8 


4x8 


4x8 


8x12 


Layer 2 no. maps 


- 


- 


- 


21 


32 


32 


86 


Layer 3 no. maps 


- 


- 


- 


- 


- 


140 


- 


Training Time 


7m 2s 


52s 


21s 


43s 


48s 


57s 


2m 53s 


SOM_R.„ 


0.571 


0.566 


0.510 


0.578 


0.540 


0.538 


0.470 


SOM™, P 


0.059 


0.038 


0.034 


0.076 


0.097 


0.110 


0.216 



Precision-recall profiles from Fig. 3 appear to agree with qualitative results from 
Fig. 2 that large fixed size SOM,,, have better profiles than the small flat GHSOM cm , 
therefore are better at query retrieval and clustering. However, by expanding the poor 
profiled one layer GHSOM _r2_4x8_0_0 (third GHSOM cm of Table 2) to a three layer 
GHSOM_r2_4x8_32_140 (sixth GHSOM, ,, of Table 2), profiles are comparable to 
the fixed SOM,,, and have been achieved in a shorter training period. One reason for 
this is that the GHSOM,,, can describes components from crowded nodes onto 
specialised sub-maps as in Fig. 1, allowing precision to increase. 

From Table 2, the third GHSOM cm with only one 4x8=32 node SOM cm at Level 1, 
expanded them all to Level 2 for the fifth GHSOM cm , therefore all image components 
were found at Level 2. However, the related fourth GHSOM CM only expanded 21 of its 
nodes, so this time component searching was conducted on two different levels. 
Results suggest that creating multiple maps improve precision, but the recall rates do 
not vary significantly. 



7 Conclusions and Further Work 

In this paper we investigated the ability of our component-based retrieval framework 
to handle a range of real queries with a collection of over 10000 trademark images. 
With large fixed SOM,,,, retrieved SOM,,,, trademark cluster maps were visually 
successful, but because of the larger number of images - and therefore components - 
the average precision-recall profiles were not been as good as in our previous study 
[7]. The multi-layered GHSOM,,,, also produced relatively good profiles in a shorter 
time (as can large single layered GHSOM,,,, being similar to similar sized fixed 
SOM,,,). However, they are not as good because relevant components were being 
distributed across multiple sub-maps, which can be missed when searching only one 
sub-map, as with Fig l’s GHSOM,,,, and GHSOM,,,, with similar shaped square and 
circle components. 

With the GHSOM cm it was found that nodes often had many components mapped 
onto them so the precision was low. This is being investigated by limiting the number 
of components that can share a node, therefore precision should go up. In cases where 
the neighbourhood radius extends beyond the edge of a sub-map, by extending the 
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search to cover adjacent sub-maps this too should increase recall, though at the 
expense of a possible loss of precision, as more image components are found. 

We have also started to investigate whether this component-based framework can 
be applied to a different topological network - Generative Topological Mapping [15], 
a principled alternative to the SOM. 



References 

1. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image 
retrieval at the end of the early years. IEEE Transactions on Pattern Analysis and Machine 
Intelligence, 22(12) (2000), 1349-1380 

2. Eakins, J.P.: Trademark image retrieval. In M. Lew (Ed.), Principles of Visual Information 
Retrieval (Ch 13). Springer- Verlag. Berlin (2001) 

3. Kato, T.: Database architecture for content-based image retrieval. In: Image Storage and 
Retrieval Systems, Proc SPIE 2185 (1992), 112-123 

4. Eakins, J.P.. Boardman, J.M., Graham, M.E.: Similarity Retrieval of Trademark Images. 
IEEE Multimedia, 5(2) (1998), 53-63 

5. Eakins, J.P., Riley, K.J., Edwards, J.D.: Shape feature matching for trademark image 
retrieval. Presented at The Challenge of Image and Video Retrieval (CIVR2003), Lecture 
Notes in Computer Science, 2728 (2003), 28-38 

6. Hussain, M., Eakins, J.P., Sexton, G.: Visual Clustering of Trademarks Using the Self- 
Organizing Map. Presented at The Challenge of Image and Video Retrieval (CIVR 2002), 
Lecture Notes in Computer Science, 2383 (2002), 147-156 

7. Hussain, M., Eakins, J.P.: Component Based Visual Clustering using the Self-Organizing 
Map. Neural Networks, submitted for publication 

8. Kohonen, T.: Self-Organizing Maps, 3rd Ed. Springer Verlag, Berlin (2001) 

9. Fritzke, B.: Growing cell structures - a self-organizing network for unsupervised and 
supervised learning. Neural Networks, Vol. 7, No. 9 (1994), page 1441-1460 

10. Hodge, V.J., Austin, J.: Hierarchical Growing Cell Structures: TreeGCS. IEEE Knowledge 
and Data Engineering, 13(2) March/ April (2001) 

11. Dittenbach, M., Rauber. A., Merkl. D.: Uncovering hierarchical structure in data using the 
growing hierarchical self-organizing map. Neurocomputing 48. (2002) 199-216 

12. Eakins, J.P., Edwards. J.D., Riley, J., Rosin, P.L.: A comparison of the effectiveness of 
alternative feature sets in shape retrieval of multi-component images. Storage and 
Retrieval for Media Databases 2001, Proc SPIE 4315 (2001), 196-207 

13. Rosin, P.L.: Measuring shape: ellipticity, rectangularity and triangularity. Proc of 15th 
International Conference on Pattern Recognition, Barcelona 1 (2000), 952-955 

14. Koskela, M., Laaksonen, J., Laakso, S., Oja, E.: The PicSOM Retrieval System: 
Description and Evaluation. Proceedings of The Challenge of Image Retrieval. Third UK 
Conference on Image Retrieval, Brighton UK (2000) 

15. Bishop, C.M., Svensen, M., Williams, C.K.I.: GTM: the generative topographic mapping. 
Neural Computation, 10(1) (1998), 215-234 




Assessing Scene Structuring in Consumer Videos 



Daniel Gatica-Perez 1 , Napat Triroj 2 , Jean-Marc Odobez 1 , 
Alexander Loui 3 , and Ming-Ting Sun 2 

1 IDIAP, Martigny, Switzerland 
2 University of Washington, Seattle WA, USA 
3 Eastman Kodak Company, Rochester NY, USA 



Abstract. Scene structuring is a video analysis task for which no com- 
mon evaluation procedures have been fully adopted. In this paper, we 
present a methodology to evaluate such task in home videos, which takes 
into account human judgement, and includes a representative corpus, a 
set of objective performance measures, and an evaluation protocol. The 
components of our approach are detailed as follows. First, we describe 
the generation of a set of home video scene structures produced by mul- 
tiple people. Second, we define similarity measures that model variations 
with respect to two factors: human perceptual organization and level 
of structure granularity. Third, we describe a protocol for evaluation 
of automatic algorithms based on their comparison to human perfor- 
mance. We illustrate our methodology by assessing the performance of 
two recently proposed methods: probabilistic hierarchical clustering and 
spectral clustering. 



1 Introduction 

Many video browsing and retrieval systems make use of scene structuring, to 
provide non-linear access beyond the shot level, and to define boundaries for 
feature extraction for higher-level tasks. Scene structuring is a core function in 
video analysis, but the comparative performance of existing algorithms remains 
unknown, and common evaluation procedures have just begun to be adopted. 

Scene structuring should be evaluated based on the nature of the content, 
(e.g. videos with “standard” scenes like news programs [3], or created with a 
storyline like movies [10]). In particular, home videos depict unrestricted content 
with no storyline, and contain temporally ordered scenes, each composed of 
a few related shots. Despite its non-professional style, home video scenes are 
the result of implicit rules of attention and recording [6,4]. Home filmmakers 
keep their interest on their subjects for a finite duration, influencing the time 
they spend recording individual shots, and the number of shots captured per 
scene. Recording also imposes temporal continuity: filming a trip with a non- 
linear temporal structure is rare [4]. Scene structuring can then be studied as 
a clustering problem, and is thus related to image clustering and segmentation 
[9,5], 

The evaluation of a structuring algorithm assumes the existence of a ground- 
truth (GT) at the scene level. At least two options are conceivable. In the first- 
party approach, the GT is generated by the content creator, thus incorporating 
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specific context knowledge (e.g. place relationships) that cannot be automatically 
extracted by current means. In contrast, a third-party GT is defined by a subject 
not familiar with the content [4]. In this case, there still exists human context 
understanding, but limited to what is displayed. Multiple cues ranging from 
color coherence, scene composition, and temporal proximity, to high-level cues 
(recognition of objects/places) allow people to identify scenes in home video 
collections. 

One criticism against third-party GTs is the claim that, as different people 
generate distinct GTs, no single judgement is reliable. A deeper question that 
emerges is that of consistency of human structuring of videos, which in turn 
refers to the general problem of perceptual organization of visual information 1 . 
One could expect that variations in human judgement arise both from distinct 
perceptions of a video scene structure, and from different levels of granularity in it 
[5] . Modeling these variations with an appropriate definition of agreement would 
be useful to compare human performance, and to define procedures to evaluate 
automatic algorithms. Similar goals have been pursued for image segmentation 
[5] and clustering [9], but to our knowledge work on videos has been limited. 

We present a methodology to evaluate scene structuring algorithms in con- 
sumer videos. We first describe the creation of a corpus of 400 lruman-generated 
video scene structures extracted from a six-hour video database (Section 2). We 
then present a set of similarity measures that quantify variations in human per- 
ceptual organization and scene granularity (Section 3). The measures can be 
used to assess human performance on the task (Section 4), but they are also 
useful to evaluate automatic algorithms, for which we introduce an evaluation 
protocol (Section 5). Finally, the protocol is applied to compare the performance 
of two recent methods (Section 6). Section 7 provides some concluding remarks. 



2 Video Scene Structure Corpus 

2.1 Home Video Database 

The data set includes 20 MPEG-1 videos, each with duration between 18-24 
min. [4]. While relatively small (six hours), the set is representative of the genre, 
depicting both indoor (e.g. family gatherings and weddings), and outdoor (e.g. 
vacations) scenes. A manual GT at the shot level resulted in 430 shots. The 
number of shots per video substantially varies across the set (4-62 shots); see 
Fig. 2(a)). 



1 Perceptual organization is “a collective term for a diverse set of processes that con- 
tribute to the emergence of order in the visual input” [2] , and “the ability to impose 
structural organization on sensory data, so as to group sensory primitives arising 
from a common underlying cause” [1]. In computer vision, perceptual organization 
research has addressed image segmentation, feature grouping, and spatio-temporal 
segmentation, among other problems, using theories from psychology (e.g. Gestalt). 
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(contains 12 shots) 
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Fig. 1. Scene structuring tool (detail). Each column of thumbnails represents a shot. 

2.2 Tools for Scene Structuring 

We define a video structure as composed of four levels (video clip, scene, shot, and 
subshot) [4]. Home video shots usually contain more than one appearance, due 
to hand-held camera motion, so subshots are defined to be intra-shot segments 
with approximately homogeneous appearance. A shot can then be represented 
by a set of key-frames (thumbnails) extracted from each of its subshots. 

The amount of time required for human scene structuring is prohibitive when 
subjects deal with the raw videos. Providing a GUI with video playback and sum- 
marized information notably reduces the effort, but remains considerable for long 
videos due to video playing. In this view, we developed a GUI in which users 
were not displayed any raw videos, but only their summarized information (Fig. 
1). Subshots and key- frames were automatically extracted by standard methods 
[4], and thumbnails were arranged on the screen in columns to represent shots. 
As pointed out in [8] , images organized by visual similarity can facilitate location 
of images that satisfy basic requirements or solve simple tasks. In our case, the 
natural temporal ordering in video represents a strong cue for perceptual orga- 
nization. In the GUI, a scene is represented by a list of shot numbers introduced 
by the user via the keyboard, so the scene structure is a partition of the set of all 
shots in a video, created by the user from scratch. Finding the scenes in a video 
depends of its number of shots, and it takes a couple of minutes in average. 



2.3 The Task 

A very general statement was purposedly provided to the subjects at the begin- 
ning of the structuring process: “group neighboring shots together if you believe 
they belong to the same scene. Any scene structure containing between one and 
as many scenes as the number of shots is reasonable”. Users were free to define 
in their own terms both the concept of scene and the appropriate number of 
scenes in a video, as there was not a single correct answer. Following [5], such 
broad task was provided in order to force the participants to find “natural” video 



scenes. 
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2.4 Experimental Setup 

Participants. A set of 20 university-level students participated in the experi- 
ments. Only two of the subjects had some knowledge in computer vision. 
Apparatus. For each video in the database, we created the video structures as 
described in Section 2.2, using thumbnails of size 88x60 pixels. We used PCs 
with standard monitors (17-inch, 1024x768 resolution), running Windows NT4. 
Procedure. All participants were informed about the purpose of the experiment 
and the GUI use, and were shown an example to practice. As mentioned earlier, 
no initial solution was proposed. Each person was asked to find the scenes in all 
20 videos, and was asked to take a break if necessary. Additionally, in an attempt 
to refresh the subjects’ attention on the task, the video set was arranged so that 
the levels of video complexity -as defined by the number of shots- was alternated. 
A total of 400 lruman-generated scene structures were produced in this way. 

3 Measuring Agreement 

If a unique, correct scene structure does not exist, how can we then assess agree- 
ment between people? Alternatives to measure agreement in image sets [9] and 
video scenes [4] have been proposed. By analogy with natural image segmen- 
tation [5], here we hypothesize that variations in human judgement of scene 
structuring can be thought of as arising from two factors: (i) distinct percep- 
tual organization of a scene structure, where people perceive different scenes 
altogether, so shots are grouped in completely different ways, and (ii) distinct 
granularity in a scene structure, which generates structures whose scenes are 
simply refinements of each other. We discuss both criteria to assess consistency 
in the following subsections. 



3.1 Variations in Perceptual Organization 

Differences in perceptual organization of a scene structure, that is, cases in which 
people observe completely different scenes, are a clear source of inconsistency. 
A definition of agreement that does not penalize granularity differences was 
proposed in [5] for image segmentation, and can be directly applied to video 
partitions. Let 5) denote a scene structure of a video (i.e., a partition of the set 
of shots, each assigned to one scene). For two scene structures <S), Sj of a A-shot 
video, the local refinement error ( LRE ) for shot Sfc, with range [0, 1), is defined 
by 

LRE(Si, Sj,s k ) = ||i2(5 i ,s fc )Wi,Sfc)||/||ii(S , i,s fc )||, (1) 

where \ and 1 1 • 1 1 denote set difference and cardinality, respectively, and R(Si, Sk) 
is the scene in structure Si that contains Sk- On one side, given shot Sk, if 
R(Si, Sk) is a proper subset of R(Sj,Sk), LRE = 0, which indicates that the first 
scene is a refinement of the second one. On the other side, if there is no overlap 
between the two scenes other than Sk, , LRE = (||A(S), Sfc)|| — l)/||i?(Sj, Sfc)||, 
indicating an inconsistency in the perception of scenes. 
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To obtain a global measure, LRE has to be made symmetric, as 
LRE(Si, Sj, Sk) and LRE{Sj, Si, Sk) are not equal in general, and computed 
over the entire video. Two overall measures proposed in [5] are the global and 
local consistency errors, 

GCE(Si , Sj) = min < ^ LRE(Si , Sj,s k ), ^ LRE(Sj,Si , s k ) 1 , (2) 

l k k ) 

LCE(Si, Sj) = ^2mm{LRE(Si,Sj,s k ),LRE(Sj,Si,s k )}. (3) 

k 

To compute GCE, the LRE s are accumulated for each direction (i.e. from 
Si to Sj and vice versa) , and then the minimum is taken. Each direction defines 
a criterion for which scene structure refinement is not penalized. On the other 
hand, LCE accumulates the minimum error in either direction, so structure 
refinement is tolerated in any direction for each shot. It is easy to see that 
GCE > LCE, so GCE constitutes a stricter performance measure than LCE [5]. 

3.2 Variations in Structure Granularity 

The above measures do not account for any differences of granularity, and are 
reasonably good when the number of detected scenes in two video scene struc- 
tures is similar. However, two different scene structures (e.g. one in which each 
shot is a scene, and one in which all shots belong to the same scene) produce 
a zero value for both GCE and LCE when compared to any arbitrary scene 
structure. In other words, the concept of “perfect agreement” as defined by 
these measures conveys no information about differences of judgment w.r.t. the 
number of scenes. In view of this limitation, we introduce a revised measure that 
takes into account variations on the number of detected scenes, by defining a 
weighted sum, 



GCE' {Si, S 0 ) = 01 GCE{Si, Sj) + a 2 C{Si, Sj), (4) 

where J2i a i = 1) and the correction factor C{Si,Sj) = . where 

N {Si) is the number of scenes detected in Si, and N max is the maximum number 
of scenes allowed in a video {K). A similar expression can be derived for LCE’ . 

4 Human Scene Structuring 

The discussed measures were computed for all pairs of lruman-generated scene 
structures for each video in the data set. Note that, as shots are the basic units, 
partitions corresponding to different videos are not directly comparable. Fig. 3(a) 
shows the distributions of GCE and LCE between pairs of lruman-generated 
scene structures of the same video. All distributions show a peak near zero, and 
the error remains low, with means shown in Table 1. It is also clear that GCE is 
a harder measure than LCE. Given the measures that only penalize differences 
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in perceptual organization, people produced consistent results on most videos 
on the task of partitioning them into an arbitrary number of scenes. 

However, the variation in performance with respect to the number of detected 
scenes -not directly measured by GCE/LCE- is considerable. Fig. 2(b) displays 
the mean and standard deviation of the number of detected scenes for each video. 
The videos are displayed in increasing order, according to the number of shots 
they contain (Fig. 2(a)). As a general trend, videos with more shots produce 
larger variation in the number of detected scenes. Referring to Fig. 3(a), the 
strong peaks near zero are somehow misleading, as it is obvious that human 
subjects did not produce identical scene structures. The distribution of the new 
performance measures ( GCE ' and LCE ’) for weights a\ = 0.85,02 = 0.15 
are shown in Fig. 3(b). The weights were chosen so that the weighted means 
of GCE and C approximately account for half of the mean of GCE' . For the 
new measures, the distributions no longer present peaks at zero. The errors are 
higher, as they explicitly penalize differences in judgement regarding number of 
scenes. 

Overall, given the small dataset we used, the results seem to suggest that 
(i) there is human agreement in terms of perceptual organization of videos into 
scenes, (ii) people present a large variation in terms of scene granularity, and (iii) 
the degree of agreement in scene granularity depends on the video complexity. 




Fig. 2. (a) Number of shots per video in the database (in increasing order); (b) mean 
and standard deviation of number of scenes detected by people for each video. 



5 Evaluation Protocol 

To evaluate an automatic method by comparing it to human performance, two 
issues have to be considered. First, the original measures ( GCE/LCE ) are useful 
for comparison when the number of scenes in two scene structures is similar. 
This is convenient when the number of scenes is a free parameter that can be 
manually set, as advocated by [5]. However, such procedure would not measure 
the ability of the algorithm to perform model selection. For this case, we think 
that the proposed measures (GCE' /LCE') are more appropriate. Second, the 
performance of both people and automatic algorithms might depend on the 
individual video complexity. 

In this view, we propose to evaluate performance by the following protocol [7] . 
For each video, let Sa denote the scene structure obtained by an automatic algo- 
rithm, and Sj the j-th human-generated scene structure. We can then compute 
GCE’(Sa, Sj) for all people, rank the results, and keep three measures: mini- 
mum, median, and maximum, denoted by GCE' min (SA, Sj ), GCE' med (SA,Sj), 
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and GC E' max (S a, Sj), respectively. The minimum provides an indication of how 
close an automatic result is to the nearest human result. The median is a fair 
performance measure, which considers all the human responses while not being 
affected by the largest errors. Such large errors are considered by the maximum. 
An overall measure is computed by averaging the GCE' measures over all the 
videos. To compute the same measures among people, for each video, the three 
measures are computed for each subject against all others, and these values are 
averaged over all subjects. The overall performance is computed by averaging 
over all videos. To visualize performance, it is useful to plot the distributions 
of GCE ’ and LCE’, obtained by comparing automatic and lruman-generated 
scene structures, as in Fig. 3. Finally, to compare two algorithms, the described 
protocol can be applied to each algorithm, followed by a test for statistical sig- 
nificance. 



6 Assessing Automatic Algorithms 

6.1 The Algorithms 

We illustrate our methodology on two recently proposed algorithms based on 
pair-wise similarity. For space reasons, we briefly describe the algorithms here. 

The first algorithm is probabilistic hierarchical clustering (PHC) [4]. It con- 
sists of a sequential binary Bayes classifier, which at each step evaluates a pair 
of video segments and decides on merging them into the same scene according 
to Gaussian mixture models of intra- and inter-scene visual similarity, scene du- 
ration, and temporal adjacency. The merging order and the merging criterion 
are based on the evaluation of a posterior odds ratio. The algorithm implicitly 
performs model selection. Standard visual features for each shot are extracted 
from key-frames (color histograms). Additionally, temporal features exploit the 
fact that distant shots along the temporal axis are less likely to belong to the 
same scene. 

The second method uses spectral clustering (SC) [7], which has been shown 
to be effective in a variety of segmentation tasks. The algorithm first constructs 
a pair-wise key-frame similarity matrix, for which similarity is defined in both 
visual and temporal terms. After matrix pre-processing, its spectrum (eigenvec- 
tors) is computed. Then, the /C largest eigenvectors are stacked in columns in 
a new matrix, and the rows of this new matrix are normalized. Each row of 
this matrix constitutes a feature associated to each key-frame in the video. The 
rows of such matrix are then clustered using A'-means (with /C clusters), and 
all key-frames are labeled accordingly. Shots are finally clustered based on their 
key-frame labels by using a majority vote rule. Model selection is performed 
automatically using the eigengap, a measure often used in matrix perturbation 
and spectral graph theories. The algorithm uses the same visual and temporal 
features as PHC, adapted to the specific formulation. 
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6.2 Results and Discussion 

Figs. 3(c-f) show the error distributions when comparing the scenes found by 
people and the two automatic algorithms. The means for all measures are shown 
in Table 1. Comparing the results to those in Figs. 3(a-b), the errors for the 
automatic algorithms are higher than the errors among people. The degradation 
is more noticeable for the GCE and LCE measures, with a relative increase 
of more than 100% in the mean error for all cases. These results suggest that 
the automatic methods do not extract the scene structure as consistently as 
people do. In contrast, the relative variations in the correction factor are not so 
large. Overall, the automatic methods increase the error for GCE ' and LCE': 
53.3% and 52.7% for GCE', and 27.8% and 41.0% for LCE ’, for PHC and SC, 
respectively. 




Fig. 3. (a-b) Human scene structuring; (a) distributions of GCE (top) and LCE (bot- 
tom) for all pairs of video scene structures (same videos) in the database; (b) distribu- 
tions of GCE ’ and LCE'. (c-d) PHC vs. human: (c) GCE (top) and LCE (bottom); 
(d) GCE ' and LCE'. (e-f) SC vs. human: (e) GCE and LCE- (f) GCE ' and LCE'. 

The results of our protocol appear in Table 2. Again, the error by automatic 
algorithms vs. people is higher than the errors among people, and the perfor- 
mance for both PHC and SC is quite similar. We used a two-tailed Wilcoxon 
signed-rank test analysis to detect significant differences between the two auto- 
matic algorithms for the min, med, and max performance over all videos. The ob- 
tained p-values are 0.658, 0.970, and 0.881, respectively, so the difference in per- 
formance is not statistically significant. In contrast, the tests comparing human 
vs. SC produced p-values of 0.147, 0.030, and 0.004, respectively, which indicates 
that the difference in performance for the min is only significant at p<0.15 level, 
but the differences for med and max are significant at p<0.05 and p<0.005 levels, 
respectively. Similar results are obtained when comparing human vs. PHC with 
the Wilcoxon test. Examples of human- and computer-generated video scene 
structures can be seen at www.idiap.ch/~gatica/ homevideoassess.html. Note 
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that, although PHC and SC do not perform significantly different under this 
similarity measure, previous work using a different measure had favored SC [7]. 

Table 1. Error means. Human vs. human and automatic vs. human. 



Case 


GCE 


LCE 


C 


GCE ’ 


LCE ’ 


human /human 


0.0321 


0.0119 


0.2416 


0.0635 


0.0463 


PHC/human 


0.0656 


0.0216 


0.2725 


0.0966 


0.0592 


SC/human 


0.0740 


0.0377 


0.2214 


0.0962 


0.0653 



Fig. 4 displays the results for each video for the two automatic algorithms. 
Figs. 4(a) and 4(c) show the number of detected scenes (red circles), and com- 
pare them to the mean number of scenes in the GT (blue crosses) . The blue bar 
denotes the std in the GT. For both algorithms, the detected number of scenes 
matches well the GT, although somewhat underestimated. The number of scenes 
estimated by PHC (resp. SC) remain within one std of the mean human perfor- 
mance in 15 (resp. 17) of the 20 videos; in addition, in 14 (resp. 18) cases, the 
automatic method detected exactly the same number of scenes as at least one 
person did in the GT. These numbers are in agreement with the column for C 
in Table 1. 
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Fig. 4. Automatic (circles) vs. human (crosses) scene structuring. Top row: PHC. Bot- 
tom row: SC. (a-c) Number of detected scenes, (b-d) GCE’ error. The bar is the spread 
of human performance (see text for details). 

Figs. 4(b) and 4(d) show GCE ’ compared to the average of human perfor- 
mance. The circles denote the measures obtained with PHC/SC, the crosses 
denote human performance. Distinct colors represent different measures (mini- 
mum in red, median in blue, maximum ommitted for space reasons). The median 
performance of PHC (resp. SC) stays within or below one std of the median hu- 
man performance (i.e., blue circles within or below blue bars) in 9 (resp. 12) 
videos. 
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Table 2. Error means over individual performance. 



Case 


/'T/'T J?' 

-'-'min 


GCE' med 


Tfi'' 


human /human 


0.0168 


0.0563 


0.1436 


PHC/human 


0.0333 


0.0941 


0.1827 


SC /human 


0.0308 


0.0932 


0.1870 



7 Conclusions 

We presented a methodology to benchmark scene structuring algorithms in home 
videos, using human performance on the task as the baseline. The agreement 
measures, adapted from work on natural image segmentation, attempt to model 
two concepts in perceptual organization. On a small but diverse data set, our 
experiments suggest that there exists human agreement in terms of organization 
of video scenes, but that there is a considerable variation w.r.t. scene granularity, 
which seems to depend on the visual content complexity. The comparison of two 
techniques with our methodology suggested that both performed similarly well, 
but still not as well as people. A comprehensive study that compares other 
agreement measures [9,4] and structuring algorithms remains as a future goal. 

Acknowledgements. We thank the Swiss NCCR on Interactive Multimodal 
Information Management IM2 for support, and Eastman Kodak for the home 
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Abstract. Colonoscopy is an important screening procedure for col- 
orectal cancer. During this procedure, the endoscopist visually inspects 
the colon. Currently, there is no content-based analysis and retrieval 
system that automatically analyzes videos captured from colonoscopic 
procedures and provides a user-friendly and efficient access to important 
content. Such a system will be valuable as an educational resource for en- 
doscopic research, a platform to assess procedural skills for endoscopists, 
and a platform for mining for unknown abnormality patterns that may 
lead to colorectal cancer. The first necessary step for the analysis is pars- 
ing for semantic units. In this paper, we propose a new visual model ap- 
proach that employs visual features extracted directly from compressed 
videos together with audio analysis to discover important semantic units 
called scenes. Our experimental results show average precision and recall 
of 93% and 85%, respectively. 



1 Introduction 

Colorectal cancer is the second leading cause of all cancer deaths behind lung 
cancer in the United States [1]. As the name implies, colorectal cancers are 
malignant tumors that develop in the colon and rectum. The survival rate is 
higher if the cancer is found and treated early before metastasis to lymph nodes 
or other organs occurs. Colonoscopy allows for inspection of the entire colon 
and provides the ability to perform a number of therapeutic operations during a 
single procedure. During a colonoscopic procedure, a tiny video camera at the tip 
of the endoscope generates a video signal of the internal mucosa of the colon. The 
video data are displayed on a monitor for real-time analysis by the endoscopist. 
The video data are not typically captured for post review or analysis. We call 
videos captured during colonoscopic procedures colonoscopy videos. 

To the best of our knowledge, there is no content-based retrieval system 
that automatically analyzes colonoscopy videos and provides a user-friendly and 
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efficient access to important content. Such a system will be valuable as an impor- 
tant educational resource for endoscopic research, a platform to assess procedu- 
ral skills for endoscopists, and a platform for mining for unknown abnormality 
patterns that may lead to colorectal cancer. Colonoscopy videos have unique 
characteristics, rendering known definitions of semantic units such as shots and 
scenes inapplicable. New definitions are required. Colonoscopy videos contain 
many blurry frames due to frequent shifts of camera focus while moving along 
the colon. Current endoscopes are equipped with a single, wide-angle lens that 
cannot be focused. Sharpness, brightness and contrast of the image therefore are 
optimized using the endoscopist’s skills. 

We have recently developed a new framework for parsing colonoscopy 
videos [2]. The framework includes a new scene definition and a novel audio- 
based scene segmentation algorithm. In this paper, we introduce a new visual 
model that captures a special kind of a cut-like and fade-like pattern appearing 
frequently around scene boundaries of colonoscopy videos. The pattern corre- 
sponds to the endoscopist’s action of searching for the next anatomic landmark 
in the colon. Our new segmentation algorithm employs visual analysis in com- 
pressed domain based on the visual model together with our audio-based scene 
segmentation. 

The remainder of this paper is organized as follows. Section 2 briefly provides 
background on our audio-based scene segmentation and related work. We present 
our new scene segmentation algorithm in Section 3. Experimental results are 
discussed in Section 4. Finally, we offer our concluding remarks in Section 5. 



2 Background and Related Work 

We briefly summarize our work on audio-based scene segmentation for 
colonoscopy videos. To the best of our knowledge, no visual analysis for scene 
segmentation on colonoscopy videos has been investigated. Our observation on 
these videos suggests that typical scene segmentation techniques using visual 
properties developed for produced videos (e.g., sports, news clips) are not ap- 
plicable. However, hard cut and fade detection techniques are relevant and are 
summarized here. 



2.1 Audio-Based Scene Segmentation for Colonoscopy Videos 

The colon is a hollow, muscular tube about 6 feet long as illustrated in Fig. 1. 
A normal colon consists of six important parts: cecum with appendix, ascending 
colon, transverse colon, descending colon, sigmoid and rectum. A colonoscopic 
procedure consists of two phases: insertion phase and withdrawal phase. During 
the insertion phase, the endoscopist rapidly advances the tip of the endoscope to 
the most proximal location possible (cecum or terminal ileum). Careful mucosal 
examination, biopsy and therapeutic operations are typically performed during 
the withdrawal phase when the endoscope is gradually withdrawn. 
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Fig. 1. The colon endoscopic segments: 1-cecum, 2-ascending colon, 3-transverse colon, 
4-descending colon, 5-sigmoid, 6-rectum 
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Fig. 2. Scenes of a colonoscopy video 



A scene is defined as a segment of visual and audio data that correspond 
to an important part of the colon. Since a typical colon has six different parts 
and as the terminal ileum is also reachable during endoscopy, in a complete 
colonoscopy, a total of thirteen scenes are expected (see Fig. 2). 

Our colonoscopy videos include the video signal from the endoscopy unit and 
the endoscopist’s dictation when the tip of the endoscope is moving from one 
colonic segment into the next. The endoscopist speaks pre-defined terms during 
the colonoscopic procedure to indicate the current position of the video cam- 
era. Examples of these terms are “Entering rectum”, “Leaving rectum, entering 
sigmoid”. “Leaving sigmoid, entering descending colon”, etc. In addition, the 
endoscopist may say terms indicating abnormality such as polyps and cancer. 
No patient identifiable information is included in the videos. Our audio-based 
scene segmentation algorithm works as follows. 

First, audio frames are classified into four types: Silence , Marker , Speech, and 
Background. Marker indicates a special sound pattern of the change in the micro- 
phone status. To determine the type of each frame, a threshold-based algorithm 
using Short-Time Energy, Zero-Crossing Rate, Pitch, and Spectrum Flux is ap- 
plied. Only the audio frames of the speech type are given to speech recognition 
software that outputs the corresponding text transcript. 

We identify the name of each scene and associated boundaries as follows. We 
classify recognized words in the transcript into six categories: Location, Action, 
Position, Abnormal, Error, and Unused. The location category includes words 
describing important anatomic landmarks of the colon such as cecum and ter- 
minal ileum. The action category includes words indicating the action of the 
endoscopist such as “entering”. The position category has words indicating the 
camera position such as “begin” and “end” . The abnormal category has terms 
indicating abnormality such as “polyp” and “cancer”. The unused category in- 
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eludes non-communicative words such as “uh”. Speech segments that cannot 
be recognized are assigned to the error category. A finite state automaton that 
recognizes the regular expression: ( Action V Position)* ■ Location is used to 
recognize words spoken at scene boundaries. 

Based on the transcript and the timestamp of each speech segment, we obtain 
the scene boundaries as follows. Starting from the first speech segment, we locate 
the nearest speech segment with the same word in the location category (e.g., 
“rectum” in “entering rectum” and in “leaving rectum, entering sigmoid”). The 
starting time of the former speech segment and the ending time of the latter 
speech segment indicate the scene boundaries. 

2.2 Hard Cut and Fade Detection Techniques for Produced Videos 

Hard Cut Detection: A hard cut is a direct concatenation of two shots, which 
indicates a temporal, abrupt visual discontinuity in a video. Existing hard cut de- 
tection techniques detect significant changes in either intensity /color histograms 
[3,4,5] or edge pixels [6] or motions [7] between consecutive frames. 

Fade Detection: A production model of a fade sequence S(x,y,t) of du- 
ration T is defined as the scaling of pixel intensities/color of a video sequence 
Si(x,y,t) by a temporally monotone scaling function /(f) as in Equation (1) [8] . 

S(x, y, t) = f(t) x Si(x, y,t),te [0, T] (1) 

For a fade-in sequence, /( 0) = 0 and f(T) = 1 while /( 0) = 1 and f(T) = 0 for 
a fade-out sequence. Typically, /(f) is a linear function. It was observed that a 
fade detector based on edge changes does not perform as well as a fade detector 
based on changes in standard deviations of pixel intensities [9] . 

3 New Scene Segmentation Approach 

Our proposed scene segmentation consists of two steps. First, we apply our audio- 
based scene segmentation algorithm discussed in Section 2.1. However, some 
scenes may not be detected because the endoscopist’s speech is not recognized by 
the speech recognition software. Domain knowledge about the scenes in a typical 
colonoscopy video is helpful in identifying the names of the missing scenes, but 
is unable to identify the boundaries of these scenes. We apply visual analysis 
based on our new visual model to determine the missing scene boundaries. 

3.1 Visual Model for Scene Segmentation 

Based on our observations and consultations with our endoscopist, we observe 
a specific pattern appearing around 60% of scene boundaries in colonoscopy 
videos. We call this pattern the cornering pattern as it corresponds to the en- 
doscopist’s action of steering the endoscope around the cornering parts of the 
colon (i.e., cecum and terminal ileum, ascending and transverse colons, trans- 
verse and descending colons, and descending and sigmoid colons). The cornering 




164 



Y. Cao et al. 



t, t. 




First sequence Second sequence Third sequence 

Images with recognized edges Blurry images Fade-in sequence 



V 

Y 

cornering pattern 



J 



Fig. 3. Cornering pattern around a scene boundary 

pattern consists of three sequences of images (see Fig. 3). The first sequence is 
composed of images with recognized edges. The second sequence has all blurry 
images — images with unclear edges. The transition between these two sequences 
is quite abrupt like a hard cut in produced videos. The third image sequence 
is like a fade-in sequence with a gradual increase in pixel intensities/color and 
edges. This sequence happens as the endoscopist starts to recognize some part 
of an anatomic landmark and gradually adjusts the camera position to make the 
image clearer. Existing production models [8,10] cannot capture the cornering 
pattern. Hence, we propose a new visual model for this pattern. Let S\(x,y,t), 
S 2 (x,y,t), and Ss(x,y,t) represent the first, the second, and the third image 
sequences, respectively. The spatial dimension is represented by x and y and the 
temporal dimension is represented by t. Thus, the cornering pattern S(x,y,t) is 
defined in Equation(2). 

S(x,y,t ) = (1 - H(t - ii)) x Si(x, y, t) + 

H(t — ti ) x (1 — H(t — t 2 )) x S 2 (x, y, t ) + 

H(t-t 2 ) x f(t-t 2 ) x S 3 (x,y,t) (2) 

where t\ denotes the timestamp of the first frame after the first sequence and t 2 
is the timestamp of the first frame after the second sequence (see Fig 3). H(t) is 
a function that outputs 1 when t > 0 and 0 otherwise. When t <0, f(t) produces 
zero; otherwise, the function is a temporally scaling function. This function is 
typically not a linear function as in the case of a production model for a typical 
fade sequence. 

3.2 Feature Extraction and Analysis 

Since our colonoscopy videos are already encoded in MPEG-2, we extract visual 
features directly from the compressed videos to reduce the segmentation time. 
We first obtain a DC-image from the Y-color plane (intensity) of each frame 
using the technique in [11]. A DC-image is a spatially reduced version of the 
original image. We compute the standard deviation of DCT coefficients in each 
DC-image. This is based on our observation that the distribution of the standard 
deviations of the DC images in the cornering pattern often follows the pattern in 
Fig. 4. That is, the standard deviation of each DC-image in the second sequence 
is generally small and smaller than those of the frames in the other two sequences. 
We call the second sequence monotone sequence. We observe that the standard 
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Fig. 4. Pattern of standard deviations of DC images in the cornering pattern 

deviations of the frames in the fade-in sequence can be modeled using a curve 
fitting method. We choose a linear regression model to describe the standard 
deviations of the frames in the third sequence by one or more linear functions. 
The challenge is to find the ending frame of each linear curve automatically. 
Hence, the scaling function f(t) in Equation(2) may be a combination of one or 
more linear functions. 



3.3 Scene Boundary Detection Algorithm 

Step 1: Preprocessing: Since more than 99% of the scene boundaries fall 
in the speech segments, we restrict visual analysis on the video segments 
corresponding to the endoscopist’s speech segments excluding those that 
contain the keyword in the abnormal category. This is because the terms in 
this category are very specific and irrelevant to scene boundaries. Next, we 
apply the filter that removes the black area (the area with DC coefficients 
below a threshold) surrounding the useful portion of the image (see Fig. 5). 





Fig. 5. Captured image and image after removal of the black surrounding region 

Step 2: Detection of a monotone sequence: A sequence of consecutive 
frames is declared as a monotone sequence if it has at least a pre-defined 
minimum number of consecutive frames with the standard deviation of each 
of these frames below a monotone threshold. 

Step 3: Hard Cut Detection: To check a discontinuity between the first se- 
quence and the monotone sequence, we use a sliding- window of size 2w + 1 
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consecutive frames. We first position the center of the sliding window at the 
frame immediately before the first frame in the monotone sequence. We de- 
rive a sequence of bin-wise histogram differences between DC-images of two 
consecutive frames in the window. We declare a hard cut at the center of 
the sliding-window if the histogram difference of the two consecutive frames 
at the center is the largest within the window, and the ratio between the 
largest difference and the second largest difference in the window is larger 
than a predefined hard-cut ratio. If a cut is not found, we slide the window 
away from the monotone sequence by one frame. The same process is re- 
peated until a cut is found or a given number of frames before the monotone 
sequence have been checked. In the latter case, no hard cut is detected. 

Step 4: Detection of a fade-in sequence: We check whether two linear 
curves fit well with the standard deviations of the coefficients of DC-images 
after the monotone sequence using the algorithm in Fig. 6. 



/* Let a, be the standard deviation of the coefficients in the DC-image of frame i */ 
e := frame ID of the last frame in the monotone sequence 
i := 0; c := 0; 
repeat 

n := 2; /* consider the ending frame of the previous sequence and n 
subsequent frames */ 

repeat /* correlation coefficient value is in the range [0, 1] */ 

?'i is a correlation coefficient of <r e , . . . , <r e +„. 

?’2 is a correlation coefficient of cr e , . . . , ov+„.+i 
n := n + 1; 

while r\ — r% < (0.05 • rf) /* the change in correlation values is small */ 
if rj > 0.8 then c := c+ 1; /* a lineal' curve fits well with the values*/ 
i := i + 1; e := e + n; 
while i < 2; 

if c = 2, a fade-in sequence is detected 



Fig. 6. Fade-in sequence detector for a cornering pattern 

Step 5: Boundary Identification: If both a monotone sequence and a fade- 
in sequence are detected, the scene boundary is declared at the first frame 
after the ending frame of the fade-in sequence. However, if a hard cut and 
a monotone sequence are detected without the fade-in sequence, we declare 
the scene boundary at the hard-cut location. 

4 Performance Study 

We first obtain appropriate parameter values for the proposed visual analysis 
technique using a training set of ten colonoscopy videos. Like scene boundaries 
for produced videos, scene boundaries for colonoscopy videos are also subjective. 
Since the movement of the camera is slow, the anatomic landmark used for scene 
boundary identification by the endoscopist appears in several frames. Therefore, 
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if the software detected boundary is within two seconds from the boundary deter- 
mined by the endoscopist, we treat the detected boundary as a correct boundary. 
A detected scene is considered correct if both of the detected boundaries of the 
scene are correct. We measure the following performance metrics. 



P II Number of correctly detected scenes 

eCa Number of actual scenes 

• • Number of correctly detected scenes 
PreC1S10n Number of detected scenes 



Table 1 shows the performance comparison of the fade-in detector using one 
linear curve (“Model 1”), two linear curves (“Model 2”), and three linear curves 
(“Model 3”). The fade-in detector with two linear curves produces the highest 
precision and recall and is used in the subsequent performance comparison. 



Table 1. Effect of fade detection models on the training set 





Model 1 


Model 2 


Model 3 


Precision 


0.91 


0.94 


0.86 


Recall 


0.62 


0.78 


0.71 



Table 2. Precision and recall of three scene segmentation algorithms 



ID 


Length 


Precision 


Recall 


Time (sec.) j 




(min) 


AO 


AV-C 


AV-U 


AO 


AV-C 


AV-U 


AV-C 


AV-U 


AV-C 

AV-U 


001 


18:26 


0.89 


0.90 


0.91 


0.62 


0.69 


0.77 


2597 


7150 


0.36 


007 


25:08 


0.90 


0.91 


0.91 


0.69 


0.77 


0.77 


3600 


10588 


0.34 


009 


37:22 


0.91 


0.85 


0.85 


0.77 


0.85 


0.85 


5310 


13973 


0.38 


010 


34:24 


1.00 


1.00 


1.00 


0.85 


0.85 


0.85 


4905 


14420 


0.34 


014 


36:33 


0.91 


0.77 


0.85 


0.77 


0.77 


0.85 


5199 


14442 


0.36 


015 


23:00 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


3317 


8965 


0.37 


017 


21:14 


0.90 


0.90 


0.90 


0.69 


0.69 


0.69 


3029 


8413 


0.36 


019 


24:05 


0.90 


0.92 


1.00 


0.69 


0.85 


0.92 


3466 


9903 


0.35 


020 


13:07 


1.00 


1.00 


1.00 


1.00 


1.00 


1.00 


1800 


4800 


0.38 


047 


28:29 


0.90 


0.82 


0.82 


0.69 


0.69 


0.69 


4037 


11214 


0.37 


062 


30:34 


0.83 


0.77 


0.77 


0.77 


0.77 


0.77 


4328 


11697 


0.37 


133 


33:02 


1.00 


1.00 


1.00 


0.85 


1.00 


1.00 


4762 


12806 


0.37 


148 


24:28 


1.00 


1.00 


1.00 


0.92 


0.92 


0.92 


3460 


9582 


0.36 


152 


11:55 


1.00 


1.00 


1.00 


0.92 


1.00 


1.00 


1587 


4376 


0.36 


163 


19:34 


1.00 


1.00 


1.00 


0.92 


1.00 


1.00 


2742 


7374 


0.37 


177 


21:29 


0.89 


0.89 


0.89 


0.62 


0.62 


0.62 


3031 


8156 


0.37 


179 


29:15 


1.00 


1.00 


1.00 


0.69 


0.77 


0.77 


4184 


11252 


0.37 


185 


21:34 


1.00 


1.00 


1.00 


0.92 


0.92 


0.92 


3049 


8168 


0.37 


190 


27:07 


0.91 


0.92 


1.00 


0.77 


0.92 


1.00 


3896 


10477 


0.37 


197 


14:54 


1.00 


1.00 


1.00 


0.85 


0.85 


0.85 


2020 


5437 


0.37 


Average 


24:44 


0.95 


0.93 


0.95 


0.81 


0.85 


0.86 


3516 


9560 


0.36 
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4.1 Performance Comparison 

Given the best parameter values obtained via experiments with the training set, 
Table 2 shows the performance comparison among our audio-based segmenta- 
tion (AO), our model approach using features in compressed domain (AV-C), 
and our model approach using features derived from pixel intensities of uncom- 
pressed videos (AV-U) on twenty colonoscopy videos not included in the train- 
ing set. Both model-based approaches outperform the audio-based technique in 
terms of recall. AV-C and AV-U found 11 and 15 correct scenes missed by AO, 
respectively. AV-U performs slightly better than AV-C because AV-U can bet- 
ter detect the boundaries of the terminal ileum scene. On average, AV-C takes 
only about a third of the time taken by AV-U to segment a video on the same 
machine. Hence, a hybrid approach using AV-U for detecting boundaries of the 
terminal ileum scene and AV-C for the other scenes should give the best recall 
and precision with the segmentation time in between that of AV-C and AV-U. 



5 Conclusion Remarks 

We have presented a scene segmentation technique for videos generated from 
colonoscopic procedures. The technique employs audio analysis and a new visual 
analysis method based on our visual model for colonoscopy videos. Experiments 
on real colonoscopy videos show average precision and recall of 93% and 85%, 
respectively. Future works include benign and malignant lesions detection, video 
annotation, video summarization, and video browsing and retrieval. 



References 

1. Greenlee, R., Murry, T., Bolden, S., Wingo, P.A.: Cancer statistics. CA Cancer J 
Clin 50 (2000) 7-33 

2. Cao, Y., Tavanapong, W., Kim, K.H., Wong, J., Oh, J.H., de Groen, P.C.: A 
framework for parsing colonoscopy videos for semantic units. In: To appear in 
Proc. of Int’l Conf. on Multimedia and Expo, Taipei, Taiwan (2004) 

3. U.Gargi, Kasturi, R., S.H.Strayer: Performance characterization of video-shot- 
change detection methods. IEEE Transaction on Circuits and Systems for Video 
Technology 10 (2000) 1-13 

4. Yusoff, Y., Kittler, J.: Video shot cut detection using adaptive thresholding. In: 
Proc. of the British Machine Vision Conference, Bristol, UK (2000) 

5. Naphade, M.R., Mehrotra, R., Ferman, A.M., Warnick, J., Huang, T.S., Tekalp, 
A.M.: A high-performance shot boundary detection algorithm using multiple cues. 
In: Proc. of the IEEE Int’l Conf. on Image Processing, Chicago, Illinois, USA 
(1998) 884 - 887 

6. R.Zabih, J. Miller, K.: A feature-based algorithm for detecting and classification 
production effects. Multimedia Systems 7 (1999) 119-128 

7. Hanjalic, A., Zhang, H.J.: Optimal shot boundary detection based on robust sta- 
tistical models. In: Proc. of the IEEE Int’l Conf. Multimedia Computing and 
Systems, Florence, Italy (1999) 




A Visual Model Approach for Parsing Colonoscopy Videos 169 



8. Hampapur, A., Jain, R., Weymouth, T.: Production model based digital video 
segmentation. Multimedia Tools and Applications 1 (1995) 9-46 

9. Lienhart, R.: Comparison of automatic shot boundary detection algorithms. In: 
Proc. of SPIE Storage and Retrieval for Still Image and Video Databases VII. 
Volume 3972. (1999) 290 - 301 

10. Truong, B.T., Dorai, C., Venkatesh, S.: New enhancements to cut, fade, and dis- 
solve detection processes in video segmentation. In: Proc. of ACM Multimedia, 
Los Angeles, CA, USA (2000) 219-227 

11. Yeo, B.L., Liu, B.: Rapid scene analysis on compressed video. IEEE Transactions 
on Circuits and Systems for Video Technology 5 (1995) 533-544 




Video Summarization and Retrieval System Using 
Face Recognition and MPEG-7 Descriptors 



Jae-Ho Lee and Whoi-Yul Kim 

Image Engineering Laboratory, Division of Electrical and Computer Engineering, 
Hanyang University, Seoul, Korea 

jhlee@vision . hanyang . ac . kr , wykim@hanyang . ac . kr 
http : / /vision . hanyang . ac . kr 



Abstract. In this paper, we introduce an automatic video summarizing and in- 
dexing tool for a personal video recorder. The tool utilizes MPEG-7 visual de- 
scriptors to generate a video index for summary. The resulting index generates 
not only a preview of a movie but also allows non-linear access with thumb- 
nails. In addition, the index supports the searching of shot similar to a desired 
one within saved video sequences. Moreover, face recognition technique is 
utilized to personal based video summarization and indexing in stored video 
data. 



1 Introduction 

The popularization of digital broadcasting forces us to come into contact with a large 
amount of video data. The sheer amount of data is becoming increasingly difficult to 
handle on conventional home electronics. Utilization of the MPEG-7 is a reasonable 
approach to describe and manage multimedia data [ 1 ] .To this end, there has been 
some research done on the use of MPEG-7 in broadcasting content applications. A. 
Yamada et al. have built a visual program navigation system that uses an MPEG-7 
color layout descriptor [2]. N. Fatemi and O.A. Khaled designed a retrieval applica- 
tion using a rich news description model based on the MPEG-7 standard [3], and T. 
Walker proposed a system for content-based navigation of television programs based 
on the MPEG-7 [4], The system that Walker presented uses standard MPEG-7 de- 
scription schemes (DS) to describe television programs, but is limited to news pro- 
grams. T. Sikora applied the MPEG-7 descriptor for the management of multimedia 
databases [5]. Also, A. Divakaran et al. presented a video summarization technique 
using cumulative motion activity based on compressed domain features extracted 
from motion vectors [6]. The MPEG-7 standard is composed of seven main parts [7], 
The visual component is categorized by basic structures such as color, texture, shape, 
motion, localization and face recognition. A detailed explanation of all visual de- 
scriptors can be found in the standard document [8]. 

In this paper, we introduce a video summarizing system that generates summarized 
video efficiently. The summary information generated by the presented tool provides 
users an overview of video content, and guides them visually to move to a desired 
position quickly in a video. Also, the tool makes it possible to find shots similar to a 
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queried one or face, which is useful for editing different video streams into a new 
one. Here, MPEG-7 visual descriptors are only used to segment a video into shots, to 
summarize a video, and to retrieve a scene of interest. This work is extension of our 
previous research [18]. 



2 Implementation of Summarizing System 

The functions of a developed system consist of mainly four parts: 

1. Generation of an overview of video contents; to find a desired position in video 
data. 

2. Query by example or face; to find similar shots to a queried one in a large amount 
of video data. 

3. Nonlinear editing; to support simple editing based on the summarized video. 

4. Actor based indexing; to find scenes which contains queried person in a video. 
Each image in the summary index becomes a representative frame of the cluster 

which is composed of a number of shot segments. Users can find similar shots to his 
or her favorite shot by querying with an example in the index. Also, a new video 
stream can be easily generated by moving shot segments from the summary index. 
All of these operations have been designed simply, so they can be executed with a 
home electronics interface. The implemented functions in the developed tool are 
described in the followed sub sections. 

2.1 Video Summarization 

With our tool, a video summary is generated in three ways: semantic-based, content- 
based, and face-based summarization. Semantic summarization is for videos that have 
stories such as dramas or sitcoms, while content-based summarization is for videos 
that do not contain stories such as sports videos. In both cases, a video stream is seg- 
mented into a set of shots as a preprocessing step by detecting scene changes. The 
adaptive threshold and gradual transition detection methods are applied in this step 
[9][10]. In this stage, three MPEG-7 color descriptors: color layout, dominant color, 
and color structure are computed and saved as the features to be utilized in the sum- 
marization and retrieval process. 

Abrupt scene change detection is done simply by computing the distance between 
two sets of features extracted from adjacent frames. Since the number of abrupt scene 
changes is highly dependent upon the threshold value, an adaptive method can be used 
using the average and deviation of the local duration [9]. To detect a gradual change, 
the distance value between the current frame and the ones at the k frame is computed 
by Bescos’s plateau method [10]. To detect the exact location of the plateau form, 
metrics such as symmetry, slope fail (distance decreasing on rising phase, or vice 
versa), maximum of distance and distance difference when scene change occurs are 
used. Values of 10, 20, 30, and 40 were chosen as k for the various durations of grad- 
ual changes. The Color Layout Descriptor (CLD) was used as the feature of distance 
to detect these abrupt and gradual scene changes. 
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For key-frame extraction of the segmented video shots, some approaches have 
been proposed recently. H. Chang, S. Sull and S. Lee presented a method to measure 
the performance of key-frames [11]. Although this method is not infallible, it is cer- 
tainly applicable to the results presented in this submission. It would enable them to 
compare the results with other techniques. Also, A. Hanjalic and H. Zhang provided a 
much more thorough review of the key-frame extraction techniques as well as per- 
haps another way to assess the performance of a key-frame extraction scheme using 
cluster-validity analysis [12]. In this paper, the oldest attempt to automate the key- 
frame extraction was adapted [13], because it chooses as a key frame the frame ap- 
pearing after each detected shot boundary, and this method is appropriate for a PVR 
which receives video data via a broadcasting system. 

After the scene change detection step, a video summarization process follows. To 
perform a semantic summarization, the segmented shots are clustered to compose 
story units. To obtain content-based summarization, clustering is applied while con- 
sidering the duration of each shot segment. In both, the distance between shots is 
measured using a MPEG-7 color layout descriptor and a color structure descriptor. 

By comparing key frames from the scene change detection step, the process of shot 
clustering is followed by a modified time-constrained method. Time-constrained 
clustering is based on the observation that similar shot segments separated in time 
have a high probability to be located in other story units. As a result, remote shots are 
not in the same story unit using a time windowing method, even though the shots 
have similar features. A hierarchical clustering method merges shot segments that 
have similar features and neighbor each other in the time domain into the identical 
cluster. The time window comparing regions is fixed as 3000 seconds. Yeung pro- 
posed a Scene Transition Graph (STG) to generate story units from clustering results 
[14]. Each node of the STG is a cluster, and links between clusters are generated 
when there are adjacent shot segments. To separate story units, the following obser- 
vation was proposed: (1) shots in identical units interact with each other (2) there is 
no interaction with shots in the other story units except as one transition between 
units. With this observation, cut edge, which is one directional path between units, 
was estimated. In our system, we selected a simple numbering method to detect the 
transition point. The pseudo code of the method is shown below: 

Storyo G Clustero, lastCID = 0, j=0 
for i=0 ~ Number of Shots 

if(ClD of shoti > lastCID) 

{ 

j ++ 

Storyj G CID of shoti 
lastCID = CID of shoti 

} 

else if (CID of shoti < CID of shoti-i) 

{ 

merge Story k +i, Story k+2 , ... , Storyj 

where Story k contains the cluster which shoti is be- 
longed to 

} 
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If there is a new Cluster ID while checking the Cluster ID of continuous shots, the 
shot is regarded as a new story unit. However, when an interaction is detected, all 
related story units are merged. For example, Semantic summarization has a hierarchi- 
cal structure driven by considering the temporal locality and continuity of content in 
video. In this system, we assumed that the story is the top layer of the structure, and 
each story is organized by clusters. These clusters also have some key frames of the 
scene. Figure 1 describes the hierarchical structure of the semantic summarization. 




Shot Sc.Eiuuute 



Fig. 1. Hierarchical structure of semantic summary 



The bottom images represent the key frame of each shot segment. All shot seg- 
ments are included in the higher level of hierarchical structure which is called the 
cluster by comparing its features. And the story units hold one or more cluster as the 
highest level of the structures. 

Content-based summarization focuses on the coincidence of content without tem- 
poral information. This method can be applied especially in sports videos. For exam- 
ple, in a soccer video, player scenes, goal scenes, and audience scenes can all be clas- 
sified separately. Figure 2 shows the example of the content-based summarization 
result. 




Fig. 2. Contents-based video summary 
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In the content-based summary, there is no hierarchical structure. The far left col- 
umn depicts a video sequence arriving in the system. And the content-based summary 
results are displayed at the right side. Each cluster has key-frames of a similar feature, 
even if they are located separately in time. 

Face recognition and detection techniques can also be utilized in summarization in 
video data. This summarization scheme can be utilized efficiently for user in access- 
ing or retrieving specially in drama. Figure 3 shows the result of face based summari- 
zation in video data. 




) The cast 
members 



Fig. 3. The face based summarizing results. 



The PCA/LDA face recognition method was adapted in recognition process [20]. 
And the Haar-like feature utilized in face detection stage to the pseudo real-time proc- 
ess [19]. 

2.2 Video Segment Editing 

Video editing, which is based on a summarized index, is also supported with the de- 
veloped system. In the editing procedure, the segmented shots can be removed or 
merged into a new video stream by the user. Users can generate a video stream that 
consists of their favorite shots using this editing tool. Each of these functions is de- 
signed to be used easily with the use of a remote control. The user chooses the shot 
segment, cluster, or story in the video indexes using a remote control. This only re- 
quires pushing the insert and move button on a remote control to edit. With this sim- 
ple process, a user can make his or her favorite scenes easily. 
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2.3 Video Retrieval 

A query of similar scenes can be achieved with the developed system. The MPEG-7 
descriptors are used to find similar scenes in query by example methods. This func- 
tion plays an important role by providing user convenience with quick searching of 
favorite scenes for editing or direct access. If a user orders a retrieval function by 
clicking the query image in one index, the similar key frames are retrieved along with 
all saved summary indexes. 



3 Experimental Results 

The MPEG-7 visual descriptors are utilized in the developed system to keep up with 
the further extension of home entertainment systems using internet connectivity. The 
detailed algorithms for extracting visual features and the similarity measurement 
between the features were referenced from the XM document and software in MPEG- 
7 [15] [16]. To analyze the performance of the scene change detection with visual 
descriptors, two trailers and two music videos were selected as test video data. Usu- 
ally, the detection of abrupt scene change is easier than gradual change. Therefore, we 
employed this test data because each includes a lot of gradual scene changes. In 
TABLE 1, the number of scene changes in each video is displayed. And, the results of 
detection are shown in TABLE 2. According to the results, the MPEG-7 color de- 
scriptors can be utilized as a feature for scene change detection. 



Table 1. Information of videos in experiments 



No 


No. of 
frames 


No. of abrupt 
changes 


No. of gradual 
transitions 


1 


4005 


43 


29 


2 


3616 


55 


22 


3 


5552 


0 


34 


4 


6858 


78 


37 



Table 2. The accuracy rate of scene change detection 



No 




Abrupt change 


Gradual transition 




Recall 

(%) 


Precision (%) 


Recall (%) 


Precision (%) 


1 


93 


98 


81 


88 


2 


94 


98 


88 


65 


3 


- 


- 


87 


77 


4 


95 


97 


81 


89 



The example of the scene change detection and key frame extraction of a movie is 
displayed in Figure 4. The key frame is selected as the first frame for each shot. From 
the experimental results, the average of cluster numbers decreased to 30 percent of 
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Fig. 4. Example result of an video summary 

the total segmented shots. After the clustering procedure, all the clusters are gathered 
and classified into story units. The average of total number of story units is usually 
about 15 percent of the clusters. The generated shots, clusters and story units form a 
hierarchical structure can assist the consumer to view and access the video content 
easily. To analyze summarization efficiency, six types of video data were utilized. 
The summarized results are shown in TABLE 3. 



Table 3. Experimental results on video summarization 



Video 


Duration 


No. of shots 


No. of clusters 


No. of stories 


Comedy 

show 


45m 15s 


757 


279 


34 


Drama 


19m 46s 


174 


47 


6 


Animation 


27m 27s 


630 


267 


14 


Music show 


83m 13s 


1 141 


428 


43 


News 


60m 36s 


1066 


712 


66 


Movie 


39m 21s 


472 


167 


12 



The numbers of stories are dependent upon the characteristics of the video data. In 
the case of video data of a short duration shot such as news, music shows, and com- 
edy shows, there are many shots, cluster, and stories in the summarized index results. 
On the other hand, drama, animation, and movie data have a small number of stories 
because it has long and similar video content, respectably. 

The retrieval result of the queried key frame is presented in Figure 5(a). The color 
descriptors are also utilized. The compounding multi-feature can generate more accu- 
rate retrieval results. The detailed performance analysis of the compound color de- 
scriptors is described in [17]. 

The image on the top left is the query image in Figure 5(a). The similar frames are 
displayed in a window by retrieval result. A user can directly access and edit with this 
result. Figure 5(b) shows the example of the edited scene results with three individual 
dramas. The generated scene includes only one actress according to user choice. The 
new scene can also be saved for reproduction. On the top left hand side is the list of 
opened indexes, and on the right hand side it indicates the summarized index of the 



Video Summarization and Retrieval System 177 




(a) 




Fig. 5. (a) The retrieval results in key frames, (b) The results of new video segment with three 
different video clips 



selected files. The two bottom rows are edited scenes and generated scenes which 
have been selected by the user. The result of a newly generated scene is displayed in 
the middle. 



4 Conclusions 

The object of this research is to develop a summarizing system using only the visual 
descriptors in Part-3 without any human intervention or manual annotation. In this 
paper, we have introduced our video summarizing tool. The resulting tool enables 
users to access a video easily through a generated summarization index. In addition, 
the summarization index also supports other operations that can be helpful and inter- 
esting for users: querying a scene and editing a video stream. The MPEG-7 descrip- 
tors are used to obtain video summarization and to retrieve a queried scene. The pro- 
posed tool was devised to be operated inexpensively with a simple interface; there- 
fore, it can be embedded in a PVR. Furthermore, the system can be extended for a 
video search engine with internet connectivity PVRs using the MPEG-7 technique. 
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Abstract. This paper proposes an method of automatically generating 
video digest that can dynamically change the scene component ratio in 
relation to context semantics and to individual distinctive events. We im- 
plemented a system of generating tennis digest, where three parameters 
given by the user, i.e. player being focused on, digest composition, and 
total duration, were reflected in the content. We asked several people 
with TV production experience to create correct digest data for refer- 
ence, which should be generated when the same parameters are given. 
The reference data and the output of the system were compared to eval- 
uate the effectiveness of the proposed method. The results revealed that 
the generated digests could convey the semantics of the original video 
reasonably well, and they demonstrated the performance and the validity 
of our approach. 



1 Introduction 

Recently, the amount of visual information available has been rapidly increasing 
across various fields. Video summarization is expected to become increasingly 
important, considering its capability of enabling information to be accessed more 
efficiently and important segments or highlights to be browsed from the entire 
content within a limited time. 

The previous approaches to video summarization can be classified into two. 

The first has mainly focused on automatically extracting low-level features 
from various media, such as color, texture, camera motion, facial characteristics, 
captions, sound classification, and TF-IDF for transcripts. It identifies important 
scenes by using a combination of these features and their transitions [1]- [4]. Many 
examples have been reported where it has been applied to video that has a 
comparatively simple structure, such as news videos. A common drawback has 
been that it becomes difficult to identify specific semantic content such as what 
happens in each individual scene, since the method is based on low-level features. 

The second has involved several researches which have allowed considerable 
manual input of essential data, where indices related to semantic content have 
been designed, generated, and applied so that they are easily manageable [5] -[6]. 
Once the indices are manually obtained, flexible summarization can be achieved 
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according to various requests, because indices representing context flow and dis- 
tinctive events are available. In many cases, though, refinement by a person is 
costly and time-consuming in many cases, and so automatic indexing remains. 

This paper proposes a method that can automatically generate a digest that 
includes the semantic content of the original video adapting to suit user’s pref- 
erences, by focusing on context flow and distinctive events. We implemented a 
system for generating digest from real tennis footage. Although this paper is 
based on an approach using indices related to semantic content, we introduced 
domain knowledge and human-action analysis, and limited the kinds of indices, 
as much as possible, to those that could be obtained by automatic analysis. 

The rest of the paper is organized as follows. Section 2 presents the require- 
ments for the summarized video and its indexing process. Section 3 describes the 
process of identifying important scenes from the indices obtained in section 2 and 
the process of generating the summarized video and accompanying text adapted 
to user’s preferences. Several experimental results are presented in section 4, and 
the conclusion is summarized in section 5. 

2 Requirements for Digest and Acquisition of Semantics 

The following four items are considered to be the requirements to generate a 
digest in this paper. The digest should: 

1. express the flow of the whole match, 

2. be able to display each memorable, distinctive scene, 

3. be able to dynamically reconfigure the content adapting it to the user’s 
preferences, and 

4. be generated through internal representation, such as indices, obtained as 
automatically as possible. 

The indices used in this paper are shown as follows: 

— score information P , 

— individual events during play A. 

First, scoring information is extracted from the score region in the video. 
Analysis methods that can identify the meaning of the telop region in the video 
and associate it with other media have been previously reported[7]. However, 
since the test video used was the direct output from the camera without any 
editing, the scoring information, including the start and end times, were manu- 
ally prepared in this paper. 

Several studies have been reported into indexing methods for detecting in- 
dividual distinctive moments during play, especially for sports videos[8]-[ll]. In 
this paper, an approach based on domain knowledge and human-action analysis 
is introduced, since the indices obtained by this approach are expected to be 
more flexibly associated with the semantic content [10]. 

Two kinds of indices are automatically extracted as listed in the table 1: 
play event A t to represent the temporal order of each player’s actions, and A n 
to indicate distinctive events, such as excellent shots by each player[ll]. 
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For example, the event “ serving ” is identified by the following conditions: 
that both players “stay” at the “backout court ” at a certain time, followed by 
either player doing an “overhead swing ” at the “backout court”. 



Table 1. Two kinds of play events used 



ID 


Play event representing 
temporal order of player’s action 


ID 


Play event representing 
player’s best action 


0 


forehand stroke 


0 


service ace 


1 


backhand stroke 


1 


double fault 


2 


forehand volley 


2 


serve & volley 


3 


backhand volley 


3 


stroke ace 


4 


smash 


4 


smash success 


5 


serving 


5 


smash failure 


- 


— 


6 


passing success 


- 


— 


7 


passing failure 



3 Generating Digest 

Figure 1 shows a block diagram of the digest generation. 

The input data consists of score data P, video data V , player’s basic events 
in time order A t , player’s best events A n , text elements T e , and user’s input I. 



3.1 Generating Structured Video Data 

First, structured data M is generated that hierarchically describes the sets, 
games, and points for the whole match, the serving player, the point-winning 
player, basic events of each player, etc. This helps to understand each player’s 
actions and the status of the match at any given time. 

3.2 Acquisition of Context Development 

Then, using M, internal representation R is generated that describes the overall 
development of the match. In this paper, the development of the match was 
analyzed using the value of the player’s superiority and its changes during the 
match. 

Superiority value s of a set (or game) indicates how much either player dom- 
inated the set (or game), and is calculated as s = di/d max , where 

di = (actual difference between players in games/points before the last 
game/point was won in the set/game) 

dmax — (maximum possible difference between players in games/points before 
the last game/point was won in the set/game) 
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P V At An Te, I=(fsJp,Tt) 




S 




Fig. 1. Block diagram for generat- Fig. 2. Overview of match development 
ing digest through superiority value S 



For example, the superiority value of a set with the game score 6-2 is cal- 
culated as s s = 3/5 = 0.6. Likewise, the superiority value of a game won af- 
ter the second deuce can be obtained as s g = —1/5 = —0.2, as the point 
count is 40-A (in other words, 4-5 in the number of points won). Here, the 
minus sign shows that the opposite player was dominant during the game. If 
|s s | < s s _thi, the match was recognized as “close”. If |s s | > s s _th 2 , “one-sided” , 
and if s s _thi < | | < s s _th 2 , “smooth”, respectively. Currently, the thresholds 
are set as follows: s s _thi = 0.2, s s _t.h 2 — 0.6. Likewise, s g thi = 0.25, s g _t.h 2 = 0.6. 

Let us introduce another superiority value S obtained by accumulating the 
superiority value for each game s g ( s s may be used, but s g was selected here 
because s s provides a resolution that is too low.): S = s g- 

As shown in figure 2, S can be interpreted as an indicator representing the 
flow of the match until a certain time point. Match 1 shows that the player 
corresponding to the minus side strengthened his/her hand in the first half, and 
ended up in a completely one-sided development for him/her in the second half. 
Likewise, match 2 shows that the balance was maintained in the first half, and, 
the plus player temporarily got on top until the minus player again regained 
momentum in the middle, and, the plus player then strengthened his/her hand 
in the second half. 

In summary, internal representation R describing the development of the 
match is obtained by determining the superiority value s s ,s g ,S indicated as a 
“one-sided” , “close”, or “smooth” development of the game in each set, and by 
specifying whether these values maintained or changed during the course of the 
match. 

3.3 Generating Surface Sentence 

Surface sentence T representing the output narration text is generated using R, 
narration text element T e , and user input I . 
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Here, T e denotes the collection of nouns, verbs, adjectives, adverbs, etc. that 
are sufficient to describe match flow and various player actions. It currently 
consists of simple tables made up of IDs and various text elements. 

For example, if the development flow for the first set was one-sided, the 
following output can be obtained as a surface sentence by referring to T e : ‘‘Player 
Oka won the first set with ease by 6-2. ” 

The following example can be obtained as an example related to the best 
actions by the players: ‘‘He won with great passing shots and a strong serve-and- 
volley game. ” 

In this paper, two kinds of surface sentence are generated: T c which relates 
to the flow of the match represented by R, and T n which relates to the best 
actions by the players. 

In addition, while generating surface sentences, user input f p is considered 
and text elements related to the focus player are prioritized. This focus player 
is the person or team the user likes and can be selected or ignored in generating 
the digest depending on preference. Therefore, by using f p , the candidate sen- 
tences related to the specified player or team can be selected preferentially in 
the generated narration text. 



3.4 Acquisition of Video Segments Corresponding to Surface 
Sentence 

Video segments V c , V n typically representing each generated sentence T c ,T n are 
obtained from the relevant range. 

In this paper, video segments related to the match flow were chosen from the 
scenes that included the last hitting event by the player for the last point of a 
game from each set, and from the scenes that included the most frequent events 
in every game determined as “close” or “one-sided”. 

Video segments related to important events by players were chosen by or- 
dering the best events of the plays, such as passing shots and service aces, be- 
forehand and by selecting events that were highest in rank, after considering the 
development flow in the match and the focus player f p . 

The rank represents a value related to the surface sentence T and indicates 
the degree of importance of the text elements within the whole context. Gener- 
ally, a surface sentence T has text elements , and Tj is of different importance 
depending on the content. For example, subjects and verbs are higher in rank 
as these are essential to a sentence. Likewise, modifiers become low in rank. 



3.5 Determining the Digest 

Finally, T' c and T' n are chosen in descending rank order, to fall in the range of 
a user-specified duration T t . At the same time, corresponding V' c and Vf are 
determined. The process stops when the sum of the duration of the selected 
video segments becomes larger than T t . Finally, the digest S can be obtained 
after concatenation into sequential order. 
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Here, another user-input value f s , the digest composition, also affects the 
rankings. The digest composition specifies whether the user wants to see a digest 
based on match development, or to see a digest focusing on best events such as 
fantastic shots, or to see to see a digest that includes both elements. Depending 
on whether the content of element T* is related to match flow or to best events, 
the weight of the rank attached to element T, changes, also varying the temporal 
portion of each sentence and video segment to be used in the final digest. 

4 Experimental Results 

4.1 Results by Proposed Method 

Digest video and narration text were generated using several combinations of 
parameters, such as content composition f s = {0: focus on match flow, 1: fo- 
cus on best events, 2: both}, the focus player f p = {0: player A, 1: player B, 
2: both}, and the total time for the digest T t = {15, 30, 60 (sec)}. Narration 
texts were generated by piecing together the candidate sentences and compen- 
sating numerical values, etc. Each candidate sentence had a corresponding video 
segment. 

In the following example, a generated digest is shown with these parameters: 
content composition = 0, focus player = Oka , and total time = 30 (sec). 

“In the first set, ”, “Oka dominated in a one-sided set, ”, “player Oka won the 
first set with ease by 6-2. ”, “In the second set, though Hinomura temporarily got 
on top,”, “it became a close match, and”, “player Oka escaped by 6-).” 

Another example of a generated digest is shown below with the parameters: 
content composition = 1, focus player = Oka, and total time = 30 (sec). 

“In the first set, Oka won with a series of passing shots”, “and excellent 
serve-and-volley play, and maintained his lead throughout the set, ”, “player Oka 
won the first set with ease by 6-2. ”, “In the second set, Oka won with a service 
ace”, “and stroke ace,”, “player Oka escaped by 6-).” 

Another example is shown below with the parameters: content composition 
= 1, focus player = Hinomura, and total time = 30 (sec). 

“In the first set, though Hinomura made some excellent shots with great serve- 
and-volley play”, “and several service aces, ”, “player Hinomura lost the first set 
with ease by 2-6. ”, “In the second set, although Hinomura made some excellent 
shots with great passing shots”, “and several stroke aces,”, “player Hinomura 
was beaten by f-6. ” 



4.2 Results by People of TV Production Experience 

In the proposed method, the three parameters given by a user, i.e. focus player, 
digest composition, and total duration, are reflected in the content of the gener- 
ated digest. We asked several people having TV production experience to create 
correct digest data for reference, which should be generated when the same pa- 
rameters are given. Table 2 shows the overview of generating digest for reference. 
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Table 2. Overview of generating digest for reference 



item 


content 


creator 


5 people with TV production experience including sports 


test sequence 


3 matches of unedited material video 
2 matches of edited broadcast video 


parameters 


ft = {0,1}, ft = {0,1,2}, T t = {15,30,60} 



450 digests composed of video and narration were generated using 18 com- 
binations of parameters for each creator and for each test sequence. 

In the following example, a generated digest is shown with these parameters: 
content composition = 0, focus player = Oka , and total time = 30 (sec). 

“In today’s friendly match men’s singles, with the opponent Hinomura,”, 
“Oka aggressively took the net from the early stage. ”, “After he broke the fifth 
game of the first set with his persistent play, ”, “Oka on the bandwagon took the 
first set on his pace. ”, “In the second set, though temporaily pushed into a corner 
by Hinomura, ”, “ Oka steadily piled up the points and won the match straight by 
2 - 0 . ” 

Another example of a generated digest is shown below with the parameters: 
content composition = 1, focus player = Oka, and total time = 30 (sec). 

“In today’s friendly match men’s singles, with the opponent Hinomura,”, 
“Oka showed fast-moving games with his serve-and-volley play, ”, “winning all 
the credit on service,”, “and return ace.”, “When the opponent took the net, Oka 
hit through the cross ”, “and straight shots. ”, “Oka maintaining his pace through t 
the match”, “and won straight by 2-0.”, 

Another example is shown below with the parameters: content composition 
= 1, focus player = Hinomura, and total time = 30 (sec). 

“In today’s friendly match men’s singles, with the opponent Oka,”, “Hino- 
mura showed fast-moving games,”, “scoring points after points by taking the 
net.”, “He put on a sharp performance”, “in serving and retruning,” , “making 
fun of Oka with his outstanding shots. ”, “However, affected by missed shots at 
the important places ”, “Hinomura lost out the match by 0-2. ”, 

4.3 Comparison and Discussion 

First, let us confirm the qualitative properties of the correct digest generated 
for reference. Correct narrations had certain structural patterns. Namely, the 
first sentence introduced the match and the players, the last sentence summa- 
rized the result, and the sentence between them described the developments and 
highlights during the match, demonstrating the logical process of introduction, 
development, turn and conclusion. This trend was common regardless of test 
sequences and creators. 

Table 3 shows words with a high frequency of use in the first, intermediate, 
and last sentence of the correct data. It shows that there is a certain correla- 
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tion between the kinds of words used and the logical structure of introduction, 
development, turn and conclusion. 

Also, it indicates that the intermediate sentences tend to include words rep- 
resenting the match development when f s = 0, whereas when / s =l, they tend to 
include words representing super plays. 



Table 3. Words with high frequency of use in correct narration for reference 



sentence 


fs=0 


fs = l 


first 


match, singles, 
friendly, open, 
women, men, Japan, 
mix, doubles 


match, singles, 
friendly, women, 
versus, open, men, 
Japan, mix 


intermediate 


set, game, win, 
service, however, 
form, point, lead, 
drop, miss, play, 
take, shot, become, 
in, finally, net 


set, service, shot, 
however, take, game, 
play, ace, opponent, 
straight, volley, 
service, stroke, win, 
score, net-play, -hand 


last 


set, do, lose, 
straight, victory, 
win, count, finally, 
brilliant, take 


set, victory, lose, 
match, do, count, 
straight, take, 
brilliant, player 



The following similarities and differences were confirmed by comparing the 
digest generated by proposed method with the correct data for reference. 

1. The generated digest represented the overall development of the match, but 
did not contain the introduction and closing remarks as a summary of the 
match. These elements can be generated by using score data and metadata 
representing the match and the players. 

2. The generated digest favorably reconfigured the content for match develop- 
ment and individual best events, corresponding to the value of f s ,f p ,T t . 

3. Only the limited expressions were used in the generated narrations. The 
narrations created the artificial impressions compared to the lively and vivid 
ones used in the correct data for reference. The generated narrations need 
to be improved by introducing natural language processing technology, for 
instance. 

4. Some parts of the scenes used in the generated digests were the same as in 
the correct data for reference, such as the last play of each set, aces, etc., 
but there were many exceptions. Further comprehensive evaluations through 
subjective tests would be necessary. 
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Topics that remain for future work are: fine-tuning the digest generating 
process, improving the method of evaluating the generated digest, and verifying 
the effects by using the generated digest for more subjects. 

Figure 3 shows an example of playing back the generated digest by the pro- 
posed method. 




Fig. 3. Synchronized playback of generated digest video and narration text 



5 Conclusion 

We proposed a method which can adaptively generate a digest, including the 
semantic content of the original video, depending on the user’s preferences, by 
focusing on context flow and distinctive events in the video. 

Indices for player’s basic actions and score information were generated using 
the player’s position, the ball position, and the time point of ball impact from 
video of actual tennis footage. The system was designed to capture the flow of 
the whole match and the memorable scenes such as great shots, as the essential 
components in the generated digest. 

The digest generated by proposed method were compared with the correct 
digest created by the people with TV production experience. Fine-tuning the 
digest generation process, improving the method of evaluating the generated 
digest, and verifying the effects by using the generated digest for more subjects 
remain future work. 
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Abstract. One of the challenges in image and video retrieval is the content- 
based retrieval of images and videos in the web. Less work has been done in 
this area, mainly due to scalability issues. For this reason, in this paper we 
investigate this problem by presenting tools for the characterization of the 
visual contents on specific web collections and a strategy for the search of faces 
in the web using visual and text information. A case study is also presented in a 
specific web domain. 



1 Introduction 

Content-based image and video retrieval is a fast growing and increasing relevant 
research area. The research community recognizes the following main challenges in 
this field [8]: the bridging of the semantic gap (understanding the meaning behind the 
query), the content-based retrieval of videos (finding a video similar to another one), 
and the increasing huge amount of digital data, produced by digital consumer devices 
(e.g. digital cameras) and computational devices (hard disks, CD-ROMs, etc.), which 
needs a semantic understanding and also produces a scale problem. In addition to that, 
we believe that the content-based retrieval of images and videos in the web is an 
important and challenging area where less research has been done, probably because 
of technical and practical reasons. As all of us know, popular search engines allow the 
retrieval of images on the web using only text queries. This situation should be 
improved and we think we should start to develop methods and strategies for the 
content-based retrieval of information on the web, the largest and most used 
multimedia database in the world. 

The web is growing at an increasingly rapid pace. More importantly, faster 
computers and network connections are allowing creators of web content more 
freedom to add, with fewer constraints, larger quantities of images, graphics, and 
video. At the same time, people’s interest in using images from the web has also 
increased (the words pictures and pics are among the most queried terms). 
Furthermore, given the trend to enrich websites with multimedia, it becomes 
increasingly important to be able to characterize a given collection of the web 
according to the multimedia elements that it contains. This type of information is of 
great importance for Internet service providers (who can determine required levels of 
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regional service), for content producers, and for web search application developers. 
Characterizing the multimedia contents of the web, however, is a challenging 
technical problem. First, one must deal with huge amounts of distributed data. 
Second, it is necessary to use media-specific content-based analysis tools to be able to 
determine the content of the multimedia elements. With images and video, this means 
developing tools to automatically determine their visual characteristics: color, texture, 
shape, etc. More interestingly, it implies using algorithms to automatically detect 
objects of interest (e.g. persons). Obviously, given the large amounts of data, manual 
classification is not an option. 

In this context, this paper studies the content-based retrieval of images on specific 
web collections (for practical reasons the whole web can not be studied at the 
moment), and also the characterization of the visual contents on these collections. For 
doing that we have developed tools for: efficient web-crawling, content-based image 
analysis (low-level features such as color, shape and texture), skin segmentation, face 
detection and web pages’ clustering using text information. For developing and 
testing these tools we have analyzed more than 4 millions web pages, processed more 
than 383 thousand images (about 35 billion pixels!) and clustered the text of more 
than two thousand web pages. 

This article is structured as follows. Related work is presented in section 2. In 
section 3 a strategy for the content-based retrieval of faces using visual and text 
information is proposed, motivating our image characterization results. Tools for 
processing and analyzing the images of a web collection are described in section 4. In 
section 5 we present a characterization of the image contents of the .CL domain as a 
case study. Finally, we conclude in section 6. 



2 Related Work 

The content-based retrieval of images and videos in the web is an underdeveloped 
area. However, some preliminary work has been done. Two of the most important 
early works are here outlined. In [9] is presented a system for automatically indexing 
images collected from the web. Images are automatically collected and assigned to 
categories based on text surrounding the images. In addition, visual features are 
extracted from the images to construct a search engine that allows search by visual 
content. However, the content-based analysis performed in this work is restricted to 
color histograms. In [4] is implemented a similar system, which in addition uses 
automatic face detection to index images on the web. This work differs from ours in 
the specific processing tools being used (our skin and face detection algorithms are 
much faster), and also in the fact that for solving actual technical problems 
(bandwidth, response time, etc.), we split the retrieval process in two: the off-line 
creation of the image database for a specific web collection and the on-line retrieval 
of the images. Concerning web characterization, to our knowledge, there have not 
been any studies of web content that use content-based features to characterize the 
images on the web. In the work of [9] for example, over 500,000 images and videos 
were catalogued, but general statistics on the visual content of the images in the entire 
collection (or a subset of the collection using a pre-defined criteria such as our .CL 
domain) were not presented. Finally, the first version of our characterization study 
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was presented in a regional conference [5]. In that study only 83,000 images were 
employed, text information was not analyzed and no distinction was done between 
home page images and inner page images. All this processing is performed in the 
current version of this study, citing the mentioned work for the low level feature 
image analysis. 



3 Towards a Face Search Engine 

As established in the introduction, the content-based retrieval of images and videos in 
the web is an important and challenging task that should be addressed. However, due 
to technical limitations (bandwidth, storage capacity, processing time, etc.) a general 
retrieval system of images for the web cannot be build at this time. This doesn’t 
means that nothing can be done for the moment. On the contrary, this task can be 
addressed in an incremental way. To start we propose: (i) to restrict the domain of 
operation of the retrieval system to a certain web collection to build a vertical image 
search engine. We have chosen to work on the .CL domain, whose characteristics and 
dimensions allow the implementation of a prototype; (ii) to create an image database 
where the search process will be carried out. This database (a cache of images) is 
created off-line, using the crawling tool described in section 4.2, for solving the 
problems related with the required time for the gathering of the images. Thus, the on- 
line process of image retrieval is performed on this database; (iii) to filter the images 
to be stored in the database according with the functionality of the retrieval system to 
be built (for dealing with the storage capacity limitations). In our case we want to 
build a person search engine; this means that graphics, images non-containing skin 
and images non-containing faces should be filtered, in addition with repeated images, 
all these filters are described in section 4; (iv) to process and label the images to be 
stored in the database. In the case of the person search engine that means to store the 
position of the faces detected in each image and the web page class of the text 
associated with this image (the page clustering algorithm employed is described in 
4.6). 

Webfaces, the person search engine under construction, is based on the use of face 
and text information. For searching a given person, the user should provide a picture 
of the person and optionally a related text (a group of keywords). The search system 
will provide a set of database images where the person can be present, and a 
confidence value for each image. The text information will be used to determine the 
associated web page class and therefore to restrict the search process to a given 
portion of the database (the set of images with this associated class). The face 
contained in the provided picture is used to do a similarity search with all the faces 
containing in the selected subset of the database using a face recognition algorithm. 
Using this information, the text clustering information will be used to recommend 
clusters related to the similar images as well as keywords that can help to improve the 
query. All relevant subsystems of Webfaces are already built. We are working on the 
implementation of fast similarity algorithms based in metric spaces and on solving 
scale and orientation issues of the face recognition process, to finish the integration of 
all the mentioned components. 
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4 Tools for Analyzing the Images of a Web Collection 

4.1 Proposed Methodology 

For developing and testing the image retrieval and analysis tools, which are the same 
tools employed for characterizing the contents of certain web collections, we 
employed real web data (images and text) sampled from the .CL top level domain (4 
millions web pages, 383,000 images and the text of 200,000 web pages containing the 
selected images.). The processes employed for obtaining and processing this data are: 
(i) web-crawling for sampling a given web collection and obtaining image links (I- 
URLs) and web pages associated with these images, i.e. the web pages where the 
images are found (W-URLs); (ii) color, edge and texture low-level visual analysis for 
characterizing different kinds of images and for constructing image filters, such as 
photographs vs. graphics and indoor vs. outdoor; (iii) skin segmentation algorithms 
for detecting image areas where humans and human-body parts are present; (iv) face 
detection algorithms for detecting humans; and (v) tools for clustering web pages 
using the text information associated with the processed and selected images. 

4.2 Web-Crawling 

Our web-crawling architecture is based on a long-term schedule for collecting sites 
and a short-term schedule that worries about network politeness and use of resources 
(CPU, bandwidth) [1], First we obtain a list of the domains of interest (all the domains 
registered under .CL) and then we use our crawler to obtain the web pages in each of 
the selected domains. The next step consists of automatically extracting the links to 
the images (I-URLs) and the links to the associated web pages (W-URLs). For 
practical purposes (processing time and storage capacity) the total amount of links is 
sampled and a statistical representative subset of them is employed for the developing 
and testing of the tools. The crawling of the .CL domain was performed in May 2003, 
August 2003 and January 2004. Each time about 1.3 million web pages were analyzed 
and the downloaded images were 100,000 in May 2003, 83,000 in August 2003 and 
200,000 in January 2004. Text information was processed only in January 2004, and 
the total amount of web pages downloaded for this processing was 200,000. 

4.3 Low-Level Visual Analysis 

A set of 72 visual features that represent color, shape and texture was extracted (see 
feature description in [5]). Although some of these features are fairly simple, they are 
useful in giving a snapshot of the visual content of images in the web and in the 
construction of image filters. Using these basic features we build a photograph vs. 
graphics filter. This filter was implemented using a support vector machine classifier 
and 5 from the 72 features (aspect ratio, standard deviation in the R histogram, 
average of the S component, percentile 90% in the R histogram and the texture feature 
LD in 0°), which were automatically determined using forward selection [13]. The 
performance of the obtained classifier is 94.5%. 
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4.4 Skin Segmentation 

This functionality was implemented using SkinDiff [7], a robust skin segmentation 
algorithm that uses neighborhood information. The decision about the pixel’s class is 
taken using a spatial diffusion process that employs context information. In this 
process a given pixel will belong to the skin class if and only if its Euclidean distance, 
calculated in a given color space, with a direct diffusion-neighbor that already belongs 
to the skin class, is smaller than a certain threshold (T di f). The seeds of the diffusion 
process are pixels with a high probability of being skin, i.e. the skin probability is 
larger than a certain threshold ( T seed ). The extension of the diffusion process is 
controlled using a third threshold (T min ), which defines the minimal probability 
allowed for a skin pixel. SkinDiff uses the RGB color space (normally images in the 
web use this color space) and a Mixture of Gaussians (MoG) model for determining 
the skin probabilities. For a fast computation, the MoG is implemented using look up 
tables (LUTs). It is not necessary to store the skin probabilities in the LUT, but only 
the information concerning the following three situations: skin probability larger than 
Tseed, smaller than T min or in [ T see d , T min ], Therefore for each possible RGB 
combination, only 2 bits needs to be stored. For an adequate implementation of the 
LUTs, the colors in each channel are quantized to 64. Using SkinDiff a 320x280 
image is processed in about 0.2 seconds. 

4.5 Face Detection 

This algorithm detects frontal faces with small in-plane rotations. The detector 
corresponds to a cascade of filters, where each filter discard non-faces and let face 
candidates pass to the next stage of the cascade. This architecture seeks to have a fast 
detection, considering the fact that only a few faces are to be found in an image, while 
almost all of the image area corresponds to non-faces. This fast detection is achieved 
in two ways: (i) having a small complexity in the first stages of the cascade, and (ii) 
using simple rectangular features (the filters), which are quickly evaluated using a 
representation of the image called the integral image [12], Each of the filters of the 
cascade is trained using the Adaboost classifier [12]. The images are analyzed using 
24x24 pixel windows. Each window corresponding to a color image is preprocessed 
(filtered) using the skin segmentation algorithm described in 4.4. The number of skin 
pixels is each window is counted, and if this number is smaller than 50% of the pixels 
of the window, then this window discarded, otherwise, it is further processed. With 
this procedure, face detection time was reduced by a factor of 2 and the number of 
false detections was reduced considerably with an increase in the face detection rate. 
The increase in the detection rate was achieved by reducing the number of stages in 
the cascade when the detector was applied to color images (in gray scale images 49 
stages were used, while in color images only 42). Additionally, the cascade 
processing was complemented using a statistical classifier added in parallel at the end 
of the cascade. The idea behind this procedure is the following: when fewer stages in 
the cascade are implemented, the detection rate increases but the false detection rate 
also rises (remember that each cascade stage filters non-face windows). On the other 
hand, a statistical classifier of face and non-face windows, implemented using color 
and texture low-level features, decreases the detection rate of the cascade, but also the 
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false detection rate. Thereafter, a best compromise can be found between the obtained 
detection rate and false detection rate, by placing the statistical classifier at its end. 
After many trials it was found that the best place to put the classifier was after the 
stage number 35. The selected classifier was the SVM and the low-level features 
determined using forward selection [13] were average of the B channel, standard 
deviation of the G channel, average V component, number of colors greater than 2% 
of image area, percentile 50 of the G channel, percentile 10 of the B channel, and the 
number of edge pixels in 45° greater than the average edge of the window. Finally, 
the obtained detections (detected face regions) are fused for determining the size and 
position of the final detected faces. Overlapping detections are processed for filtering 
false detections and for merging correct ones. All detections are separated in disjoint 
sets using the heuristic described in [1 1], 

4.6 Text Clustering 

Images in the web are inserted into web pages using the IMG html tag. The attribute 
ALT of the IMG tag allows us to specify a text alternative to the image, which is 
automatically displayed when the browser cannot display the image. Some images 
are included within a hypertext anchor: in this case an image may behave as a button 
linked to other documents or resources. The text in the ALT attribute, along with the 
text inside the hypertext hidden meanings. This motivated us to use the whole text 
anchors are candidate descriptors for the image. However, only a small fraction of the 
images in our collection have such descriptions. Furthermore, the quality of these 
descriptions is low; many of them have few words which sometime refer to file 
names with in the web pages as the accompanying text for the images. The text in 
web pages gives us some approximated context for each image. We left as future 
work the discovering of better descriptors for images. Such task may consider 
heuristics for extracting data from anchor text, ALT tags, or other parts of the html 
page that includes the image. We ran a clustering process over text in web pages of 
the images. Our goal is to discover clusters that define textual contexts for the images. 
Such clusters are the basis in our approach for integrating textual contexts in our 
image retrieval tool. Clusters centroids can be used to model textual contents, and 
user queries, specified as list of terms, can be compared against the centroids to 
determine the relevant contexts users are searching for. When the cluster associated to 
a query is found, the search for images can be focused on the images of the cluster. 
The clustering process is achieved by an implementation of a k-means algorithm 
provided by the clustering toolkit CLUTO [3], We used a k-means algorithm for its 
simplicity and low computational cost. In addition, it has proved to be very effective 
for clustering collection of documents [14], 



5 Case Study: Characterization of the Images of .CL Domain 

Due to the limited extension of this article in this section we present a very condensed 
part of our characterization results. The complete results, including histograms, 
graphics and a complete statistical analysis can be found in our website [15], 
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5.1 Crawling 

Domains and pages. In the most recent official study of the .CL domain [2] almost 2 
million pages were found in 38,307 sites in 34,867 domains. Current estimations of 
TodoCL [10] point out that the Chilean Web has 5 millions of pages ± 10% and that 
the number of sites and domains is 80,000 ± 10%. From the 3.5 million pages used for 
this final version of our study (we considered up to 4 levels of links ), we obtained two 
samples: one of home pages and one of inner pages. 

Home Page Images. This collection was obtained from 36,455 home pages; from 
those home pages, 23,523 had objects or links to non-textual URLs. In total 338,963 
links were found, 208,066 of them unique. From the unique links, 60.0% were to GIF 
images, 26.8% to JPG images, 7.7% to Flash animations, 2.6% to style-sheets, and 
0.7% to PNG images; the rest was mostly to PDF or Word documents. The total 
number of GIF, JPG and PNG images was 183,669, from those, 100,000 were 
randomly selected. 

Inner Page Images. The sample of inner pages was obtained in 8 hours of crawling, 
with 443,000 pages downloaded. We discarded all the pages that were at depth greater 
or equal to 5 in the websites, and all the pages without links to images, obtaining a 
sample of 311,589 pages. We believe that this sample is representative of what a user 
sees while browsing the web; and using pages at deeper levels would bias the sample 
towards large, dynamic websites. These pages contained 9,148,115 links to images, 
and only 926,781 were unique, relatively much less unique links than in the home 
page collection. Our interpretation is that web site owners usually have a small set of 
images, which are repeated across their entire websites. From the unique links, 53.9% 
were to GIF images, 35.4% to JPG images, 2.8% to Flash animations, 2.2% to style- 
sheets, and 0.8% to PNG images. There is a significant diminution of animations in 
the inner pages. The total number of GIF, JPG, and PNG images was 842,902. From 
those, 100,000 were randomly selected. 

5.2 Image Processing and Analysis 

Visual Features. We extracted all 72 visual features mentioned in section 4.3 from 
the 200,000 images that were processed. It was found that 19.2% of the images 
correspond to photographs and the rest to graphics. It is interesting to mention that the 
number of images with a certain area follows a Power law distribution. The analysis 
was split between home pages images (HP-images) and inner pages images (IP- 
images). 

Skin Detection. It was detected skin in the 6.5% of the HP-images and in the 7.9% of 
the IP-images. The reason of these different percentages seems to be the larger size of 
IP-images and the larger proportion of photographs in this set. The average size of the 
skin clusters is 3167/3121 pixels, and the mean number of skin cluster in each image 
is 3.73/4.14, for the HP- and IP-images, respectively. 

Faces Detection. We found that 2.07% of the HP-images, while 2.12% of the IP- 
images contained faces. The average number of faces per image (from those images 
containing faces) is 2.1167/2.1162 for the HP- and IP-images, respectively. The 
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maximum number of faces found in a single image was 89/39 for the HP-/IP-images’. 
It was also found that the distribution of the number of faces in both image sets 
(considering only the images that contain faces) is close to a Power law. For example, 
for HP-images considering cases from 2 to 10 faces, the parameter of the distribution 
is -2.13. 



5.3 Text Analysis 

We consider two sets of 1,965 images each, corresponding to images that arise in web 
pages. The first set has images with high probability of having portions of human 
skin, and the second set contains images with human faces. The face images belong to 
1,480 web pages, while the skin images belong only to 748 different web pages. 
These images were obtained using the algorithms already described. We model the 
associated web pages as term-weight vectors using a vocabulary of 20,600 words. 
Stopwords were eliminated from this vocabulary. The cosine similarity function 
between vectors was employed. 

The clustering process is guided by a score function that measures the overall 
clusters quality. The score function used is the total sum of the average similarities 
between the vectors and the centroids of the clusters that are assigned to. Each run of 
the algorithm computes k clusters. Thus, in order to study adequate values of A' we run 
the algorithm several times. We reach a quality score of 0.80, which reflects high 
similarities between objects in each clusters, at A=250 and k= 300 for the home-skin 
and home-face dataset, respectively. In [15], we show some figures that depict the 
quality of the clusters found for different values of k, for the face and skin web page 
sets. These figures also show curves with the incremental gain of the overall quality 
of the clusters, and a histogram for the number of clusters per average intra-cluster 
similarity for the two datasets at the aforementioned values of k. Many clusters that 
represent clearly defined contexts for images were found. In [15], we present also 
tables with some of the found clusters. These clusters allowed us to discover semantic 
connections between web pages having faces. An example of that is a cluster 
containing web pages related to movies, musicals, DVDs, etc. Also, good search 
keywords can be detected using word frequency for each cluster. 

5.4 Processing Time 

Page Gathering-. It took about 8 hours for the 400,000 pages using a single, standard 
PC running Linux. With this setting, the page recollection of the whole Chilean web 
takes about five days. Feature Extraction : The process of automatic extraction of the 
72 visual features on the 200,000 images under analysis takes about 47 hours on a 
single, standard PC running Linux. Skin Segmentation and Face Detection : The 
process of skin segmentation and face detection on the 200,000 images took about 10 
hours and 40, respectively, using a single, standard PC running Linux. Webpage 
Clustering-. The text of 2228 web pages was clustered. It took 5 minutes to compute 
300 clusters using a standard PC, running Windows XP. Obviously, any of these 



A group photo in www.bradford.cl has 89 faces! 
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processes can be speeded up using more than 1 PC, and can be done in a reasonable 
time, as is an off-line task. 



6 Conclusions 

We investigated the content-based retrieval of images on specific web collections and 
also the characterization of the visual contents on these collections. For doing that we 
presented tools for: efficient web-crawling, content-based image analysis (low-level 
features such as color, shape and texture), skin segmentation, face detection and web 
pages’ clustering using text information. For developing and testing these tools we 
analyzed more than 4 millions web pages, processed more than 383,000 images and 
clustered the text of more than 2,200 web pages. A first application of these tools is 
the characterization of the image contents of the .CL domain. For carrying out this 
study a statistical representative subset of the total number of images of the .CL 
domain was employed. In the final version of this article we plan to also include 
results on clustering a larger sample of only the text segments surrounding the 
selected images. 

In this article we also presented a strategy for the content-based search of persons 
using visual and text information is proposed. All relevant components of this system, 
including a face recognition subsystem, are already built. We are working in the 
system final integration, which will be reported in a near future. 
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Abstract. An approach for classification of building images through 
rule-based fuzzy inference is presented. It exploits rough matching and 
problem domain knowledge to improve precision results. This approach 
uses knowledge representation based on a fuzzy reasoning model for es- 
tablishing a bridge between visual primitives and their interpretations. 
Knowledge representation goes from low level to high level features. The 
knowledge is acquired from both visual content and users. These users 
provide the interpretations of low level features as well as their knowledge 
and experience to improve the rule base. 

Experiments are tailored to building image classification. This approach 
can be extended to other semantic categories, i.e. skyline, vegetation, 
landscapes. Results show that proposed method is promising support for 
semantic annotation of image/video content. 



1 Introduction 

The evolution of multimedia applications from the recent past to the near fu- 
ture goes from automatic signal extraction to user-provided knowledge, passing 
throughout features and semantics. A critical paradigm that has captured at- 
tention of researchers is “bridging the gap”, i.e. from features to semantics [1]. 
This problem can be stated as “to find a technique for automatic recognition of 
the underlying semantic structure of given features” . 

In general terms, annotation is a process to represent features by symbols. In 
the context of semantic annotation these symbols represent user interpretations 
of the features. The process of assigning a feature vector to a specific symbol 
is a classification task. Thus, the annotation process is decomposed into feature 
extraction followed by classification. 

In this work, an approach for semantic classification of building images 
through rule-based fuzzy inference is presented. It uses knowledge representa- 
tion based on a fuzzy reasoning model for establishing a bridge between visual 
primitives and their interpretations. 

* The research leading to this paper was done within the framework of the Network 
of Excellence SCHEMA. 
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This approach exploits rough matching, problem domain knowledge and the 
effect of weighted rules in fuzzy rule-based classification to improve precision 
results. Consequently, a high classification performance can be achieved with a 
relatively simple and consistent model which goes in the way human reason- 
ing works. [2] presents a detailed analysis of rule weight effects in classification 
systems. 

The approach is tailored to annotate building images. Existing approaches 
for classification of building images use a bayesian framework to exploit image 
features by perceptual grouping [3], binary bayesian hierarchical classifiers [4], 
or perfom building semantic extraction using support vector machines [5]. 

Following, Section 2 defines the classification problem. Afterwards, Section 3 
presents the approach. In Section 4 knowledge representation for building image 
classification is described. Section 5 gives a summary of experimental results. 
Finally, Section 6 concludes the paper. 

2 Problem Definition 

This work addresses the problem of classification of low level features into high 
level concepts as a first step towards semantic image/video annotation. 

Let x = ■ ■ -,x 1 n ,x 2 n, ■ ■ ■ ,x M n ) be an image, f = { f (1) , . . . 

be feature sets where f is a function of the image x, E v = (ej, . . . , e v L ) be a 
pattern extracted from a feature vector f ^, and Y = {yi, . . . ,Vk} be a class 
set. The classification problem is stated as: Learn a function 

y(x) :E V ^Y, (1) 

where yj are symbols identifying categories and representing semantic interpre- 
tations of pattern E v . 

y(x) can be decomposed into K single-class specialised classifiers 

flj(x) : E v 1 <j< K. (2) 

Subsequently, a fuzzy model is extracted from feature set in order to 
approximate each function yj(x) by a set Rj = {Rji , . . . , Rjc} of C if-then 
rules. 

Therefore, g(x) is summarised by a rule base R as follows: 

K,C 

g(x)*R= \/ WR j*’ ( 3 ) 

j=i,fc=i 

where w £ [0, 1] is the weight of rule Rjk- 

3 A Semantic Annotation Process 

Based on [6] the proposed image annotation process denoted by L can be sum- 
marised as 



{V 0 ,x u Y}L{V n , Zi }, 



(4) 
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where K° is an abstract non annotated image space defined as a tuple {X, Y}, 
X is a set of images, Y is a set of symbols, and x t £ X is a selected image from 
a video unit to be annotated. 

The process L begins by creating an instance of the abstract non annotated 
image space V° = 0. Afterwards, the annotation operator l £ L maps V° into the 
annotated image space in an interactive annotation session, which is a sequence 
of annotation spaces V °, . . . , V n with V"(xj) = l{V n ). 

The result is a set of labels Zj = {zu, . . . , z^m} which represents annotations 
spatially and temporally linked to descriptive concepts of the video unit content. 
Each label has the form 

Zij : Xj i-»- yj, (5) 

where g(x.i) = z^j and yj is a symbol labeling descriptions of image x, . 

These descriptions can be either real objects in the image (e.g. building, car, 
person, etc.) or abstract concepts describing what is happening in the image 
(e.g. a person is driving a car). On the one hand, classification of real objects 
is useful for annotation of video units by applying image understanding. On the 
other hand, classification of abstract concepts is closer to video understanding. 

Fig. 1 corresponds to an overview of the process. Firstly, a pre-processing 
step performs filtering and selection of suitable low level features from a train- 
ing set X in order to facilitate pattern extraction. [7] presents a pre-processing 
procedure on sequence of images extracted from videos for improving building 
image classification. 




Fig. 1 . Data flow of the approach for semantic annotation 



Afterwards, a rule-based fuzzy model is extracted from feature set to 
associate low level features with each specific class represented by a particular 
symbol yj. This step corresponds to the training of single-class classifiers. In 
addition to the information extracted from sample images, the fuzzy model is 
improved with problem domain knowledge provided by a user. 

The result is a classification model based on a fuzzy reasoning and used 
for establishing a bridge between visual primitives and their interpretations [8]. 
Finally, classification results are used by process L as annotations of image x, 
with symbol yj. 
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4 Building Image Classification 

This approach is tested using a standard visual descriptor whose main char- 
acteristic is the simplicity to represent local and global distribution of edges 
without expensive computation. It uses luminance mean values to detect edges 
that is considered a weak approach. However, weakness of low-level description is 
compensated with contributions of problem domain knowledge given by a user. 
Contributions can be performed on-line. Thus, fine-tuning is possible through 
relevance feedback. 

Section 4.1 presents the semantics of the descriptor used to extract salient 
features from building images and Section 4.2 gives details of classification model. 



4.1 Feature Space 

Feature space is built with feature vectors extracted from image database us- 
ing the MPEG-7 edge histogram descriptor[ 9]. This visual descriptor uses 16 
histograms to represent local distribution of directional edges within an image. 
These histograms consist of bins associated with five edge categories namely 
horizontal, vertical, 45°, 135°, and nondirectional edge. 

The image is spatially decomposed into 16 sub-images using a fixed grid 
with equal-size rectangles. Each sub-image is divided into a given number of 
non-overlapping small square blocks (image-blocks). Consequently, the block 
size depends on the sub-image size. The basic components of edge histogram 
descriptor are shown in Fig. 2. 

Blocks are also divided into 4 sub-blocks. The luminance mean values in the 
gray scale is measured in order to determine the sub-block pixel intensity. Then, 
blocks are passed through five masks to assign the corresponding edge category. 

The number of blocks per edge category is counted to compute the edge 
distribution within a sub-image. The bins in the histograms summarise the dis- 
tribution of each edge category of the 16 sub-images in a left-right and top-down 
scanning. 
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Fig. 2. Basic components of edge histogram descriptor 
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The semantics of edge histogram descriptor can be summarised as 

ehd(x) = {h\,h\, . . . . . . ,h \,. . .}, (6) 

where x is an image, hj is the number of edges of category j in the i th sub-image. 
Therefore, feature space consists of feature vectors with 80 dimensions. 

4.2 Classification Model 

According to semantics of edge histogram descriptor (Eq. 6) five input variables 
are defined for the fuzzy model. Each input variable corresponds to an edge type 
category. 

A compact fuzzy model is proposed associating a group of fuzzy sets with 
each input variable. In order to simplify the model three fuzzy sets are used 
by default. Linguistic hedges are not applied on this model. Three standard 
piecewise linear functions are used to determine memberships to these fuzzy 
sets: Z-slrape, A-slrape, and S-shape. These functions are adjusted to specific 
intervals on x axis called boundaries. Intervales are computed using a clustering 
technique, i.e. fuzzy c-means. In Tab. 1 presents a summary of the fuzzy variables 
with their fuzzy sets and boundaries. 



Table 1 . Fuzzy variables used for Edge Histogram Descriptor. Each fuzzy set, e.g. 
Low, Medium, or High, is defined by a membership function. These functions have a 
domain between the interval determined by the corresponding boundaries b\, 62, and 
63. Intervals are automatically generated using Fuzzy C- Means algorithm 



Universe (Variables) 


Fuzzy Set 


bi 




63 


Vertical Edge, 
Horizontal Edge, 


Low 


0.1 


0.3 


1.0 


Diagonal 45°, 


Medium 


0.1 


0.3 


0.4 


Diagonal 135°, 
Non Directional 


High 


0.0 


0.3 


0.4 



Classification model combines input variables with labels through fuzzy in- 
ference rules which are used to classify new images into the defined semantic 
categories. These inference rules are organised into a multi-input single-output 
rule space as is shown in Fig. 3. 

Each inference rule has an IF-part with five antecedents and a THEN-part 
with one consequent as follows: 

IF el is A\ . . . AND e£ is A5 THEN classify as y Vl (7) 

where e l is an instance of an input variable (edge type), A ^ is a linguistic term 
(fuzzy set name) used to transform values from a continuous to a discrete domain, 
and i/j is a symbol labeling a semantic category. Classifier uses rule base for 
labeling features as either “Non Building” or “Building”. 
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Fig. 3. Rule base. Fuzzy variables associated with each type of edge are combined in a 
multi-input single-output space. Rj is an if-then rule as is shown in Eq. 7. Fuzzy sets 
included in rule Rj are indicated with highlighted functions 



5 Experimental Results 

The classification approach is tailored to the semantic category “Building”. Sub- 
jective selection of characteristics required to classify an image into this category 
is performed. 

An image is categorised as “Building” when a building structure of a visible 
size appears in the scene. One shortcoming in this type of images is the different 
kinds of buildings, i.e. castles, houses, warehouses, religious facilities, etc. In 
Fig. 4 some randomly selected samples of building images is illustrated. 




Fig. 4. Samples of building images 



Experiments were conducted using a testing set with over 3000 distinct im- 
ages extracted from TRECVID[10] video repository. For training 15 and 100 
building images were selected from TRECVID and Corel Draw Gallery, respec- 
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tively. A set of images (key-frames) extracted from videos was automatically 
classified using the proposed approach. 

The ground truth for all images was assigned by a single subject. Classifier 
performance evaluation is based on the amount of images correctly classified 
and the number of misclassified images. The classification results are given in 
the Tab. 2. 

A first run (test 1) was performed to evaluate classification results after 
applying local analisis. As edge histogram descriptor decomposes original imae 
into 16 sub-images, local analisis means classification of each single sub-image. 
Salient edges found in these sub-images and satisfying criteria of inference rules 
are classified as “Building” . This kind of analysis allows detection of parts of a 
building structure within the scene. Classification result is over 70%. 

A second run (test 2) was performed to evaluate classification results after 
applying global analysis. It means that an image is classified as “Building” when 
the number of sub-images classified as “Building” is greater than a threshold. 
This kind of analysis allows reducing of misclassification introduced by sub- 
images satisfying conditions but not being part of a building structure. 

A third run (test 3) was used to evaluate classification results after varying 
the rule weights. These weights are real values in the range [0, 1] as is indi- 
cated in Eq. 3. It means different weights are assigned to each rule considering 
its relevance to determine the pattern matching of a sub-image. These weights 
are modified by the user (problem domain knowledge). Experiments show this 
procedure improves classification results. 



Table 2. Classification results [ % ] 



Test 


Correctly Classified 


Misclassified 


1 


70.05 


29.27 


2 


83.95 


14.51 


3 


86.31 


13.29 



6 Conclusions and Further Work 

An approach for building image classification that utilises a fuzzy reasoning 
model is presented. It decomposes the process in feature extraction and classifi- 
cation. Single-class classifiers are used to assign images into semantic categories 
defined by the user. 

This approach exploits rough matching and problem domain knowledge to 
get high classification performance even with few sample images. Combining 
few training images with contributions provided by a user 86% of precision was 
obtained classifying over 3000 images extracted from real-world videos. 
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Experiments are tailored to building image classification. However, this ap- 
proach can be extended to other semantic categories, i.e. skyline, vegetation, 
landscapes, etc. 

This approach was tested using a standard visual descriptor which main 
characteristic is the simplicity to represent local and global distribution of edges 
without expensive computation. Weakness of low-level descriptions is compen- 
sated with contributions of problem domain knowledge give by a user. 

The fuzzy reasoning model facilitates identification of participating rules in 
a specific classification result. Rule weights can be used either to enable or to 
disable specific rules. This tuning actions can be performed on-line. 

Futher work is addressed to extract the classification model from image 
database using unsupervised learning. Afterwards, the model is tuned using rel- 
evance feedback. 
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Abstract. In this paper, we present an approach for the retrieval of natural scenes 
based on a semantic modeling step. Semantic modeling stands for the classification 
of local image regions into semantic classes such as grass, rocks or foliage and the 
subsequent summary of this information in so-called concept-occurrence vectors. 
Using this semantic representation, images from the scene categories coasts, 
rivers/lakes, forests, plains, mountains and sky/ clouds are retrieved. 
We compare two implementations of the method quantitatively on a visually di- 
verse database of natural scenes. In addition, the semantic modeling approach is 
compared to retrieval based on low-level features computed directly on the im- 
age. The experiments show that semantic modeling leads in fact to better retrieval 
performance. 



1 Introduction 

Semantic understanding of images remains an important research challenge for the image 
and video retrieval community. Some even argue that there is an “urgent need” to gain 
access to the content of still images [1], The reason is that techniques for organizing, 
indexing and retrieving digital image data are lagging behind the exponential growth 
of the amount of this data (for a review see [2]). Natural scene categorization is an 
intermediate step to close the semantic gap between the image understanding of the 
user and the computer. In this context, scene categorization refers to the task to group 
arbitrary images into semantic categories such as mountains or coasts. 

First steps in scene category retrieval were made by Gorkani and Picard [3] (city 
vs. landscape), Szummer and Picard [4] (indoor/outdoor) and Vailaya et al. [5] (in- 
door/outdoor, city /landscape, sunset/mountain/forest). All these approaches have in com- 
mon that they only use global information rather than local information. More recent 
approaches try to automatically annotate local semantic regions in images [6]-[9] but 
the majority does not attach a global label to the retrieved images. Oliva and Torralba 
find global descriptions for images based on local and global features but without an 
intermediate annotation step [10]. 

The general goal of our work is to find semantic models of outdoor scenes. In the 
context of image retrieval it reduces the amount of potentially relevant images. But it 
also allows to adaptively search for semantic image content inside a particular category 
(e.g. an image from the mount ains-category, but with large forest, no rocks). Thus a 



P. Enser et al. (Eds.): CIVR 2004, LNCS 3115, pp. 207-215, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 




208 



J. Vogel and B. Schiele 




(c~) ro-:nt 1 n.s (f) sky/cl ouds 




Fig. 1 . Examplary images for each category. 



bottom-up step, i.e. scene categorization, and a top-down step, i.e. the use of category 
information to model relevant images in more detail, can be combined. For that goal, 
we employ a semantic modeling step. Semantic modeling stands for the classification of 
image regions into concept classes such as rocks, water or sand and the scene retrieval 
based on this information. The advantage of an intermediate semantic modeling step 
is that the system can easily be extended to more categories. Also, for local semantic 
concepts, it is much easier to obtain ground-truth than for entire images that are often 
ambiguous. In this paper, we compare two implementations of the semantic modeling 
approach for natural scene retrieval. In addition, we evaluate how the semantic modeling 
approach compares with direct low-level feature extraction. 

Concerning the database, we paid special attention to using highly varying scenes. 
The database contains hardly two visually similar images. All experiments have been 
fully cross-validated in order to average out the fact that in such diverse databases certain 
test sets perform better than others. The goal is to find out how much profit semantic 
modeling brings in a realistic setting. 

The paper is organized as follows. In the next section, our scene database and the 
image representation are discussed in detail. Section 3 explains the interplay between the 
semantic modeling step and the retrieval stage. Finally, Section 4 is devoted to several 
experiments that compare two different implementations of the system and quantify the 
performance of the semantic modeling approach vs. a low-level feature-based approach. 



2 Natural Scene Categories 

For the scene retrieval, we selected six natural scene categories: coasts, forests, 
rivers/lakes, plains, mountains and sky/clouds. Exemplary images for each 
category are displayed in Figure 1. The selected categories are an extension of the 
natural basic level categories of Tversky and Hemenway [11]. In addition, the choice of 
suitable categories has been influenced by the work of Rogowitz et al. [12]. 
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Concept 



Occurrences 


sky 


14.0% 


water 


32.5% 


grass 


0.0% 


trunks 


0.0% 


foliage 


6.5% 


fields 


0.0% 


rocks 


31.0% 


flowers 


0.0% 


sand 


16.0% 




Fig. 2. Semantic modeling 



Obviously, these scene categories are visually very diverse. Even for humans the 
labeling task is non-trivial. Nonetheless, pictures of the same category share common 
local content, such as for example the local semantic concepts rocks or foliage. For ex- 
ample, pictures in the plains-category contain mainly grass, field and flowers, whereas 
mount ains-pictures contain much foliage and rocks, but also grass. Based on this ob- 
servation, our approach to scene retrieval is to use this local semantic information. 

2.1 Concept Occurrence Vectors 

By analyzing the local similarities and dissimilarities of the scene categories, we iden- 
tified nine discriminant local semantic concepts: slcy, water, grass, trunks, foliage, field, 
rocks, flowers and sand. In order to avoid a potentially faulty segmentation step, the scene 
images were divided into an even grid of 10x10 local regions, each comprising 1% of the 
image area. Through so-called concept classifiers, the local regions are classified into one 
of the nine concept classes. Each image is represented by a concept occurrence vector 
(COV) which tabulates the frequency of occurrence of each local semantic concept (see 
Figure 2). A more detailed image representation can be achieved if multiple COVs are 
determined on non-overlapping image areas (e.g. top/middle/bottom) and concatenated. 

2.2 Database 

Our database consists of 700 natural scenes: 143 coasts, 114 rivers/lakes, 103 
forests, 128 plains, 178 mountains and 34 sky/clouds. Images are present both 
in landscape and in portrait format. In order to obtain ground-truth for the concept clas- 
sifications, all 70’000 local regions (700 images * 100 subregions) have been annotated 
manually with the above mentioned semantic concepts. Again, a realistic setting was 
of prime interest. For that reason, each annotated local region was allowed to contain a 
small amount (at maximum 25%) of a second concept. Imagine a branch that looms into 
the sky, but does not fill a full subregion ( sky with some trunks) or a lake that borders on 
the forest ( water with foliage). Due to these quantization issues, only 59’582 out of the 
70’000 original annotated regions can be used for the concept classifier training since 
only those contain the particular concept with at least 75%. The rest has been annotated 



210 



J. Vogel and B. Schiele 



Table 1. Confusion matrix of the local concept classification (k-NN classifier) 





Classifications in % 






sky water grass trunks foliage field rocks flowers sand 


^regions 



sky 


91.8 


5.7 


0.0 


0.1 


0.5 


0.2 


1.6 


0.0 


0.2 


15360 


water 


9.5 


68.1 


2.4 


0.0 


6.0 


3.8 


9.0 


0.1 


1.2 


7309 


grass 


0.9 


6.4 


34.4 


0.5 


43.1 


9.0 


4.5 


0.9 


0.5 


3541 


trunks 


0.8 


0.8 


1.5 


28.0 


45.6 


5.9 


16.3 


1.1 


0.0 


1516 


foliage 


0.5 


1.0 


2.5 


1.0 


85.1 


1.2 


7.3 


1.4 


0.0 


13470 


field 


1.2 


7.4 


6.4 


1.3 


18.8 


34.8 


27.4 


1.8 


0.9 


4070 


rocks 


1.7 


3.5 


0.7 


1.0 


24.6 


6.6 


61.0 


0.4 


0.6 


10567 


flowers 


0.9 


0.7 


2.2 


0.3 


53.0 


2.4 


4.7 


35.5 


0.4 


2051 


sand 


6.3 


19.7 


6.3 


0.4 


2.2 


16.5 


32.6 


0.3 


16.8 


1773 



doubly. As some concepts exist in nearly all images and some only in a few images, the 
size of the nine classes varies between 1 ’5 16 ( trunks ) and 15 ’405 (sky) regions. 

3 Two- Stage Scene Retrieval 

In order to implement the semantic modeling step, the natural scene retrieval proceeds 
in two stages. In the first stage, the local image regions are classified into one of the nine 
concept classes. In the second stage, the concept occurrence vector is determined and the 
images are retrieved based on that concept occurrence vector. The following describes 
those two stages in more detail. 

3.1 Stage I: Concept Classification 

The local image regions are represented by a combination of a color and a texture feature. 
The color feature is a 84-bin HSI color histogram (H=36 bins, S=32 bins, 1=16 bins), 
and the texture feature is a 72-bin edge-direction histogram. Tests with other features, 
such as RGB color histograms, texture features of the gray-level co-occurrence matrix, 
or FFT texture features, resulted in lower classification performance. The classification 
has been tested with both a k-Nearest-Neighbor (k = 30) (k-NN) and a Support Vector 
Machine (C = 8, 7 = 0.5) (SVM) [13] concept classifier. 

With 68.9% classification rate, the k-NN concept classifier showed a slightly inferior 
performance than the SVM concept classifier with 69.9% classification rate. Neverthe- 
less, its resulting classifications perform better in the subsequent retrieval stage and will 
therefore be employed in all following experiments. The reason for this behavior is that 
the global classification rate usually improves to the benefit of the large classes {sky, fo- 
liage) and at the expense of the smaller classes (field flowers, sand). Since these smaller 
classes are essential for scene retrieval, the overall classification accuracy on the first 
stage is not the most important performance measure. 

The experiments have been performed with 10-fold cross-validation on image level. 
That is, regions from the same image are either in the test or in the training set but never 
in different sets. This is important since image regions from the same semantic concept 
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tend to be more similar to other (for example neighboring) regions in the same image 
than to regions in other images. The confusion matrix of the experiments with the k-NN 
concept classifier is depicted in Table 1 . The confusion matrix shows a strong correlation 
between class size and classification result. In addition, we observe confusions between 
similar semantic classes, such as grass and foliage, trunks and foliage, or field and rocks. 

The trained concept classifier is used to classify all regions of an image into one of 
the semantic classes. The experience showed that doubly annotated regions (e.g. with 
sky and rocks at the border between the sky and a mountain) were usually classified as 
one of those two semantic concepts. 



3.2 Stage II: Scene Retrieval Based on Concept Occurrence Vectors 

The output of the first stage is localized semantic information about the image. It specifies 
where in the image there are e.g. sky or foliage-regions and how much of the image is 
covered with e.g. water. From that semantic information, the concept occurrence vectors 
are determined. Experiments have shown that the retrieval performance improves if 
multiple concept occurrence vectors are computed either on three (top/middle/bottom) or 
five image areas. This leads to a resulting concept occurrence vectors of either length=27 
or length=45. 

In the following we propose two different implementations to semantically categorize 
images based on the concept occurrence vectors, namely a Prototype approach and 
an SVM approach. In the experiments those two implementations are compared and 
analyzed. 



Prototype approach to scene retrieval. The prototype for a category is the mean over 
all concept occurrence vectors of the respective category members. Thus, the prototype 
can be seen as the most typical image representation for a particular scene category where 
the respective image does not necessarily exist. The bins or attributes of the prototype 
hold the information which amount of a certain concept an image of a particular scene 
category typically contains. For example, a forest-image usually does not contain any 
sand. Therefore, “sand- bin” of the forest-prototype is close to zero. 

When determining the category of an unseen image, the Euclidean or the Mahalanobis 
distance between the image’s concept occurrence vector and the prototype is computed. 
The smaller the distance, the more likely it is that the image belongs to the respective 
category. By varying the accepted distance to the prototype, precision and recall for the 
retrieval of a particular scene category can be influenced. 



SVM approach to scene retrieval. For the SVM-based retrieval of natural scenes we 
employ the LIBS VM package of Chen and Lin [ 1 3] . A Support Vector Machine is trained 
for each scene category. The input to the SVM are the concept occurrence vectors of the 
relevant images. The margin, that is the distance of an unseen concept occurrence vector 
to the separating hyperplane, is a measure of confidence for the category membership 
of the respective image. By varying the acceptance threshold for the margin, precision 
and recall of the scene categories can be controlled. 
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(a) COV based on an- (b) COV based on das- (c) Based on low-level (d) Based on low-level 
nolaled image regions si lied image regions 156-bin feature vector 468-bin feature vector 



Fig. 3. Scene retrieval with Prototype approach 





(a) COV based on an- (b) COV based on clas- (c) Based on low-level (d) Based on low-level 
nolaled image regions silied image regions 156-bin feature vector 468-bin feature vector 



Fig. 4. Scene retrieval with SVM approach 



4 Scene Retrieval: Experiments 

Using the database described in Section 2.2, we conducted a set of experiments in 
order to compare the performance of the two retrieval implementations. In addition, it is 
evaluated whether the semantic modeling approach is superior to using low-level features 
of the images directly for retrieval. Performance measures are precision (percentage of 
retrieved images that are also relevant) and recall (percentage of relevant images that 
are retrieved). The precision-recall curves of the experiments are depicted in Fig. 3 for 
the Prototype approach and Fig. 4 for the SVM approach. Tables 2 and 3 summarize 
the Equal Error Rates (EER) of the experiments. Both concept classification and scene 
retrieval experiments are 10-fold cross-validated on the same ten test and trainings sets. 
That is, a particular trainings set is used to train the concept classifier, the SVM and the 
prototypes. Classification and retrieval are evaluated on the corresponding test set. 



Retrieval based on annotated image regions. In the first experiment, we compared 
the performance of the Prototype vs. the SVM approach based on annotated patches. 
The goal of the experiment is to evaluate if the semantic modeling approach is effective 
given perfect data. 

The results of the experiment are depicted in Fig. 3 (a) and Fig. 4 (a). The SVM 
approach outperforms the Prototype approach in 4 of 6 cases (Tables 2 and 3). Obviously, 
coasts and rivers /lakes are the most difficult categories. In fact, the detailed analysis 
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Table 2. Equal Error Rates for Prototype approach 



Retrieval based on 


coasts 


rivers 

lakes 


forests 


plains 


mountains 


sky 

clouds 


annotated regions 


64.3% 


61.9% 


95.1% 


79.7% 


86.0% 


97.1% 


classified regions 


57.3% 


43.0% 


74.8% 


28.9% 


66.9% 


85.2% 


156-bin feature vec. 


29.4% 


41.8% 


45.6% 


11.6% 


33.8% 


2.9% 


468-bin feature vec. 


25.7% 


32.2% 


50.5% 


11.2% 


32.5% 


8.8% 


Table 3. Equal Error Rates for SVM approach 


Retrieval based on 


coasts 


rivers 

lakes 


forests 


plains 


mountains 


sky 

clouds 


annotated regions 


70.5% 


47.4% 


91.4% 


81.3% 


89.3% 


97.2% 


classified regions 


61.0% 


42.1% 


80.6% 


54.7% 


78.1% 


85.3% 


156-bin feature vec. 


56.6% 


40.3% 


77.6% 


46.1% 


59.0% 


52.9% 


468-bin feature vec. 


57.3% 


47.3% 


81.4% 


54.6% 


63.8% 


70.5% 



of the retrieval results of those two categories shows that they are frequently confused. 
The main reason is that these two categories are in fact quite ambiguous. Even for the 
human annotator it is not clear into which category to sort a certain image that contains 
some water. It is especially those ambiguous images that are also wrongly retrieved by 
the retrieval system. 

The SVM implementation has difficulties in modeling the rivers/lakes-category 
for small recall values since this category is not compact in the COV space. All other 
categories, that is plains, mountains, forests and sky/clouds, are retrieved with 
good to very good accuracy. Again the analysis of the retrieval results show that wrongly 
retrieved images are often semantically closer to the category that has been requested 
than to the “correct” category. 



Retrieval based on classified image regions. In the next experiment, images with au- 
tomatically classified local regions were considered. The concept classifier described in 
Section 3.1 and Table 1 was employed for the Stage I classification. Based on these clas- 
sifications, the concept occurrence vector is determined. The retrieval result is depicted 
in Fig. 3 (b) and Fig. 4 (b). Here, the SVM approach again outperforms the Prototype 
approach in 5 of 6 cases (Tables 2 and 3). sky/clouds, mountains and forests 
have been retrieved especially well by the SVM. The loss compared to the annotated 
scenes is quite low. Compared to the retrieval in the annotated case, coasts are retrieved 
reasonably well. 

The Prototype approach fails completely to retrieve plains, whereas the SVM is 
able to achieve an EER of 54.7%. The reasons for the general worse performance in the 
plains-category are the confusions of the concept classification stage. The plains- 
category can be discriminated by the detection of field , grass and flowers. These three 
concepts are confused to a large percentage with rocks and foliage (refer to Table 1). 
These strong mis-classifications lead to the observed low retrieval performance. 
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Retrieval without semantic modeling step. The last two experiments were carried 
out in order to find out whether the semantic modeling step is in fact beneficial for the 
retrieval task. Therefore this section will describe an experiment where we compare the 
retrieval results based on the concept occurrence vector vs. the performance using the 
low-level features directly as image representation. The same features as for the concept 
classification were used for the image representation: a concatenation of a 84-bin linear 
HSI color histogram and a 72-bin edge direction histogram. These features were once 
computed directly on the image, resulting in a global feature vector of length 156, and 
once on three image areas (top/middle/bottom), resulting in a feature vector of length 
3*156=468. The “Prototype” approach now refers to the learning of a mean vector per 
category and the computation of the Euclidean distance between the mean vector and 
the feature vector of a new image. The results of these experiments are depicted in Fig. 
3 (c)-(d) for the “Prototype” approach and Fig. 4 (c)-(d) for the SVM method. 

Both the figures and the EERs in Table 2 clearly show that the “Prototype” approach 
based on low-level features fails compared to the semantic modeling based approach 
both for one image area and for three image areas. Probably the feature space is too 
high-dimensional and too sparse. For that reason also the introduction of more local- 
ized information through the use of three image areas does not bring any improvement 
compared to one image area. 

In contrast, the low-level feature-based SVM approach performs surprisingly well 
compared to the SVM based on the semantic modeling step. The introduction of localized 
information by using three image areas also leads to a performance increase. The vari- 
ation of the EER in the three-area feature-based approach is smaller than the approach 
based on the COV. Categories such as sky/ clouds or mountains are not retrieved as 
good as with the semantic modeling approach and categories such as rivers/lakes 
are retrieved better than with the semantic modeling approach. But in summary, the 
performance increase in the rivers/lakes-category does not counter-balance the per- 
formance decrease in the sky/clouds- and mount ains-category. 



Discussion. Summarizing, we can draw two conclusions from the experiments. Firstly, 
the SVM implementation of the retrieval system outperforms the Prototype approach. 
Only single categories are retrieved better when using prototypes. Here, a combination 
of both methods might be advantageous. 

Secondly, the semantic modeling step and an approach such as the concept occur- 
rence vectors is beneficial for the retrieval of natural scene categories considered in this 
paper. For most categories, the EER obtained with the semantic modeling step is equal to 
or better than without the semantic modeling. Many of the wrongly retrieved images are 
in fact content-wise on the borderline between two categories. For that reason quantita- 
tive retrieval performance should not be the only performance measure for the semantic 
retrieval task. Still, the performance of the problematic categories rivers/lakes and 
plains can be improved by better concept classifiers in order to retrieve discriminant 
concepts with high confidence or better category models. One might, for example, em- 
ploy different numbers of discriminant concepts and/or image areas per category in order 
to differentiate between rivers/lakes and coasts. 
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5 Conclusion 

In this work, we presented an approach to natural scene retrieval that is based on a 
semantic modeling step. This step generates a so-called concept-occurrence vector that 
models the distribution of local semantic concepts in the image. Based on this represen- 
tation, scene categories are retrieved. We have shown quantitatively that Support Vector 
Machines in most cases perform better than the retrieval based on category prototypes. 
We have also demonstrated that the semantic modeling step is superior to retrieval based 
on low-level features computed directly on the image. In addition, since ground-truth is 
more easily available for local semantic concepts than for full images, the system based 
on semantic modeling is more easily extendable to more scene categories and also to 
more local concepts. Further advantages of the semantic modeling are the data reduction 
due to the use of concept occurrence vectors and the fact that the local semantic concepts 
can be used as descriptive vocabulary in a subsequent relevence feedback step. 
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and Science (BBW 00.0617). 
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Abstract. Since the number of digital multimedia libraries is growing rapidly, the 
need to efficiently index, browse and retrieve this information is also increased. 
In this context, text appearing in images represents an important entity for in- 
dexing and retrieval purposes. Often, text is superimposed over complex image 
background and its recognition by a commercial optical character recognition 
(OCR) engine is difficult. Thus, there is the need for a text segmentation process, 
including background removal and binarization, in order to achieve a satisfactory 
recognition rate by OCR. In this paper, an unsupervised learning method for text 
segmentation in images with complex backgrounds is presented. First, the color 
of the text and background is determined based on a color quantizer. Then, the 
pixel color and the standard deviation of the wavelet transformed image are used 
to distinguish between text and non-text pixels. To classify pixels into text and 
background, a slightly modified k-means algorithm is applied which is used to 
produce a binarized text image. The segmentation result is fed into a commercial 
OCR software to investigate the segmentation quality. The performance of our 
approach is demonstrated by presenting experimental results for a set of video 
frames. 



1 Introduction 

Content-based image and video retrieval has attracted a lot of attention in recent years. 
Several approaches have been developed for indexing, querying and retrieving multime- 
dia information. One possibility is to use the text embedded in images and video frames. 
Such kind of text offers important information for image and video understanding and 
is a very good entity for keyword-based queries. 

In general, text appearing in images can be classified into two groups: scene text and 
artificial text [7], Scene text is part of the image and does not represent any information 
about the image content, (traffic signs in an outdoor scene, etc.), whereas artificial text 
is laid over the image in a later stage (e.g. the name of somebody during an interview). 
Artificial text is usually a good key to index images or videos. To obtain such useful 
indexing data, an Optical Character Recognition (OCR) system must be used. If the 
original images are directly fed into an OCR system, the OCR results are often not 
sufficient due to the complexity of the image backgrounds. Text segmentation is aimed 
at simplifying or removing the background and thus increasing the text quality in order 
to achieve good results with OCR systems. 
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In this paper, the problem of text segmentation is addressed and an unsupervised 
learning method for text segmentation in images with complex background is presented. 
The problem of locating text in an image is not discussed here; our approach to this prob- 
lem is presented in [3, 4], The proposed text segmentation approach works as follows. 
First, the possible color of the text and background is determined using an appropriate 
vector color quantizer. Then, the pixel color and the standard deviation of the wavelet 
coefficients are used to distinguish between text and non-text pixels. A slightly modified 
k-means algorithm is used to classify text pixels in the image. This classification infor- 
mation is used to produce a binarized text image with black text on white background. 
This text image is then passed to an OCR system. We have used a commercial stan- 
dard OCR software system to investigate the impact of our segmentation approach to 
recognition performance. The very good performance of our approach is demonstrated 
by presenting experimental results for a set of images. 

The paper is organized as follows. Section 2 gives a brief overview of related work in 
the field. Section 3 presents the individual steps of our approach to text segmentation in 
detail. Section 4 describes the experimental results obtained for a set of images. Section 
5 concludes the paper and outlines areas for future research. 

2 Related Work 



Agnihotri and Dimitrova [1] use the average of pixel values of the text image as a 
threshold for the binarization step. The authors assume that the average of pixel contours 
of the text box is closer to the average of the pixels marked as background on the text 
image. The problem of text embedded in complex background is not addressed, and 
performance results for segmentation or recognition are not reported 

Antani et al. in [2] have presented a simple text segmentation method. To agree with 
the possible polarity of text in an image, two segmented text regions are generated. A 
connected-component method is applied to the segmented result to remove the compo- 
nents that do not fulfill the specified aspect ratios. Finally, a score is assigned to each 
of the segmented images based on their text-like characteristics. The image with the 
highest score is selected as the input for an OCR system. No performance results for 
segmentation or recognition are presented. 

Wolf et al. [16] have proposed a text localization, enhancement and binarization 
method for multimedia documents. The detected text boxes in multiple consecutive 
frames are used to create a high resolution text box using bilinear interpolation. Then, a 
combination of the classic thresholding algorithm presented in Niblack [10] and Sauvola 
[14] is used. During the binarization, a local threshold is calculated for each block 
separately. The method is tested on 60000 frames of different MPEG videos. The authors 
report to have achieved a character recognition rate of 85%. 

In an approach proposed by Lienhart and Wernicke [7], the possible text and back- 
ground color is estimated first. A 4(8)-neighborhood seed filling algorithm is applied to 
each text (background) pixel separately. Components that do not fulfill certain geomet- 
rical restrictions are removed. A binarization process follows where the global threshold 
is calculated as the mean of the text and background color. The threshold procedure is 
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applied differently for inverted and normal text. A recognition performance of 69.9% is 
given. The video test set used contains credits, commercial and news sequences. 

Wu et al. [17] use a low pass Gaussian filter to smooth the image and compute 
an intensity histogram. Then, the histogram is also smoothed and the first peak from 
the left on the smoothed histogram is used as a threshold for the binarization process. 
The algorithm was tested on a set of photographs, newspapers, advertisements, personal 
checks (with 300 dpi). A character recognition rate of 84% is reported. 

Loprestie and Zhou [8] have integrated the process of text segmentation into the 
recognition process. Two different methods are proposed for this purpose. The first one 
uses a polynomial surface fitting algorithm to recognize the characters. The second 
method is based on a fuzzy n-tuple classifier. Their methods are tested on a set of web 
pages. For the first method, a recognition rate of 69.7%, and for the second method, a 
recognition rate of 89.3% is reported. The two methods are trained with half of the test 
set. 

Sato et al. [ 1 3] employ a sub-pixel interpolation method and a multi-frame integration 
schema to enhance the text image. Then, the extraction of characters is done through the 
combination of four filters. At the end, the image is binarized using a global threshold. 
The recognition rate using their own OCR is 83.5% for a CNN headline news test set. 

Li et al. [6] first enhance the image resolution using Shannon up-sampling. Then, an 
adaptive threshold is used for binarizing the image, A block is marked as background 
only if its standard deviation is smaller than a fixed threshold. The recognition rate 
achieved for the test set used (images with low resolution) is 67.8%. 

Odobez and Chen [11] have presented a multi-hypotheses approach based on a 
Markow random field (MRF) and on grayscale consistency constraints for text segmen- 
tation. The grey level distribution in text images is modeled as a mixture of Gaussians 
distributions. The assignment of each pixel to one of the Gaussian layers is based on 
prior contextual information, which is modeled by a MRF. Each layer is considered as a 
binary text image and is fed to the OCR as one segmentation hypothesis. The text image 
which gives the best recognition performance is considered as the output of the system. 
Finally, a simple evaluation method is applied to estimate the results of the OCR. The 
authors have reported a recognition rate of 93%. In their experiments, frames extracted 
from sports videos were used. 

Hua et al. [5] have proposed a multiple frame text extraction schema. The frames 
where the same text appears in a clearly recognizable manner are averaged with each 
other to get a so called “man-made" frame, A block-based adaptive thresholding pro- 
cedure applied to this frame concludes the segmentation process. They have reported a 
character recognition rate of 78% for a test set of MTV sequences. 

Miene et al. [9] have presented a segmentation approach which consists of a region- 
growing algorithm for color segmentation and a method for separating text from the 
background based on geometrical constraints of characters. A character recognition rate 
of 81% is reported for a test set of videos from broadcast news and magazines. 
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3 Unsupervised Text Segmentation 

The approach proposed in this paper is designed to segment horizontally aligned text in 
images of arbitrary font, size and color. The system input is an image and the coordinates 
of the text bounding boxes in the image. The text bounding boxes can be automatically 
computed, e.g. by using a text localization algorithm [3, 4] (the description of this 
algorithm is beyond the scope of this paper). In contrast to many other approaches, the 
use of global thresholds [1, 6, 7, 13] or local thresholds [5, 10, 14, 16, 17] is avoided in 
our approach by applying unsupervised clustering. Furthermore, the feature vector not 
only consists of color information, but also of wavelet coefficients in order to consider 
local text-specific characteristics. The segmentation approach can be divided into four 
main steps, which are described in detail below: 

1. Resolution Enhancement; 

2. Text Color Estimation; 

3. Feature Selection and Normalization: Color and Wavelet Coefficients 

4. Classification of Pixels. 

3.1 Resolution Enhancement 

Most commercial OCR systems perform best if the image resolution is at least about 
300 dpi. Furthermore, the subsequent segmentation steps also perform better, if the 
superimposed text is not too small. Thus, in case that the input image has a lower 
resolution, it is rescaled up to 300 dpi by a cubic interpolation. 

3.2 Text Color Estimation 

In this step, the dominating text color is estimated for each text box, following an ap- 
proach suggested in [7], First, the number of colors in the text box is reduced to the 
nr .color most dominating colors using a color quantization method [18]. The num- 
ber of these colors can be set as a parameter. Then, two color histograms are calcu- 
lated: the histogram of nr.text.rows center rows in the text box and the histogram of 
nr. back gr .rows rows directly above and underneath of the text box. Finally, the dif- 
ference histogram of those two histograms is calculated. The text and the background 
color are defined as the maximum and the minimum of this difference histogram. 

3.3 Feature Selection and Normalization: Color and Wavelet Coefficients 

We have investigated several features in order to find the best ones to classify pixels as 
text or background. The basic feature vector consists of the red, green, and blue pixel 
color component, scaled to the range [0, 1]. Furthermore, a small sliding window (e.g. 
3*3 pixels) is moved over the text box to consider local image properties. This technique 
is motivated by two observations. First, characters usually have a unique texture. Second, 
the border of superimposed characters results in high-contrast edges. Consequently, we 
apply the wavelet transform to the image to consider these properties and pass them to 
the subsequent clustering algorithm. The standard deviation of wavelet coefficients in 
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the sliding window is expected to be low within a character’s texture, but high at its 
boundary. Thus, the character boundaries are enhanced in the segmented image by using 
this feature. 

The general purpose of the wavelet transform is to decompose a signal into subbands 
at various scales and frequencies. The wavelet transform can be implemented using filter 
banks consisting of high-pass and low-pass filters. The application to an image consists 
of a filtering process in horizontal direction and a subsequent filtering process in vertical 
direction. For example, when applying a 2-channel filter bank (L: low-pass filter, H: 
high-pass filter), four sub-bands are obtained after filtering: LL, HL, LH and FLH. The 
three high-frequency sub-bands (HL, LH, HH) strengthen edges in horizontal, vertical 
or diagonal direction, respectively. The wavelet coefficients of these sub-bands are used 
as features in our approach. Since the text-to-background contrast is expected to be high 
in the grey-scale transformed image but not in each color channel, we decided to apply 
the wavelet transform to the grey-scaled version of the image. In our approach, we have 
chosen a 5/3 wavelet filter bank evaluated in [15]. The standard deviation of wavelet 
coefficients in a sliding window at position (x, y) is defined as follows: 



stdev W i n dow(x, y) — 
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where I(x + i, y + j) is the wavelet coefficient at pixel position (x + i, y + j ), and 
meanbiock(%i y ) is defined as: 
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Except for the color components which are normalized as described before, the 
feature vector components are normalized as follows. If the lower bound of the original 
feature range is smaller than 0, i.e, [- m, n], then the range for the features is shifted 
by m to [0, m+n]. Then, for a given image, the maximum max is computed for this 
feature, and the corresponding component in each feature vector is divided by max to 
normalize it to the range [0, 1], 



3.4 Classification of Pixels 

The k-means algorithm is a popular and well-known algorithm to partition data into k 
clusters. The number of clusters k is assumed to be known in advance. In our case, there 
are two clearly distinguishable classes of pixels: “text" and “background". 

The “text" cluster is initialized with the feature vector f text that has the minimum 
Euclidian distance to the ideal feature vector f representing the “text" cluster, while 
the “background" cluster is initialized with the feature vector f background, that has the 
maximum Euclidian distance to f. For the “text" (“background") cluster, the ideal feature 
vector includes the predefined “text" (“background") color and the normalized wavelet 
coefficients (respectively their standard variation in the sliding window), which are 1 (0) 
in the ideal (non-ideal) case. Then, the k-means algorithm is applied to obtain clusters 
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whose members have the minimum Euclidian distance to the respective cluster mean 
feature vector. Finally, a segmented text box is generated where the pixels of the cluster 
“text" are painted in black, while the other pixels are painted in white. Thus, we obtain 
a binary image output, where black text appears on white background. 



4 Experimental Results and Discussion 

We have tested our text segmentation algorithm on various types of images. The test set 
consists of 20 images and covers a wide variety of background complexity and text type. 
There are 205 words and 1404 characters in those 20 images in total. 

To evaluate the performance of the proposed text segmentation algorithm, character 
recognition experiments have been conducted. The recognition rate is used as an objective 
measure of the algorithm’s segmentation performance. We have used a demo version 
of the commercial OCR software ABBYY FineReader 7.0 Professional for recognition 
purposes. After segmentation, the segmented binary text image was fed manually to the 
OCR. 

The following parameters were used for the estimation of text and background color: 
nr-color = 6, nrdextjrows = 4, nr Jbackgr rrow s = 2*2. The sliding window size 
was set to 3 x 3 pixels. No assumptions were made about the resolution of the input 
images. The wavelet 5/3 filter bank evaluated in [15] was used with the low-pass filter 
coefficients (-0.176777, 0.353535, 1.06066, 0.353535, -0.176777) and the highpass 
filter coefficients (0.353535, -0.707107, 0.353535). 

To estimate the best feature vector as well as to test the impact of subsequent res- 
olution enhancement to character recognition, a first experiment was conducted with 
the original image resolution (72 dpi). Several compositions of the feature vector were 
investigated in this first stage: 

1. RGB Color Components; 

2. RGB Color Components + Wavelet Coefficients; 

3. RGB Color Components + Standard Deviation of Wavelet Coefficients; 

4. RGB Color Components + Wavelet Coeff. + Standard Dev. of Wavelet Coeff. 



Table 1. The recognition performance after text segmentation using various feature vectors and 
compared with the original OCR results. 



Feature Vector 


Char, recogn. 


Word recogn. 


OCR on Original Image (72 dpi ) 


49.1% 


24.9% 


Color (72 dpi) 


65.5% 


40.5% 


Color + Wavelet Coef. (72 dpi) 


66.6% 


42.9% 


Color + Wavelet + StdDev. Wavelet C. (72 dpi) 


70.9% 


43.4% 


Color + StdDev. Wavelet. Coef. (72 dpi) 


70.7% (+21.6%) 


46.8% (+21.9%) 


OCR on Original Image (300 dpi) 

Color + StdDev. Wavelet Coef. (300 dpi) 


jj— 


34.1 % 

65.4% (+31.3%) 
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The OCR results are shown in Table 1. The best overall result considering both 
character and word recognition for the 72 dpi resolution images was achieved for the 
feature vector consisting of R, G and B color components and the standard deviation 
of wavelet coefficients within the sliding window (70.7% for character recognition and 
46.8% for word recognition). Furthermore, we applied the text segmentation algorithm 
using this feature vector to a resolution enhanced image (up to about 300 dpi). As a result, 
after segmenting this high resolution image, the character recognition rate increased to 
85.9% while 65.4% of the words were recognized correctly, as also shown in Table 
1. Some examples of our text segmentation algorithm and the recognition results are 
presented in Figure 1 . 

We agree with Wolf et al. who remarked in [16] that a comparison with other ap- 
proaches is very difficult, if not impossible, due to lack of a common test base. For 
example, in [11] as well as in [16] the classical binarization method of Otsu [12] was 
implemented and compared with the performance of the authors’ own approach. In [ 16], 
the authors’ implementation of Otsu’s method led to a low character recognition rate 
of 47.3%, and their own approach achieved 85%. In [ 1 1], the implementation of Otsu’s 
method obtained a character recognition rate of 88%, while the authors reported 93% 
for their own approach. Clearly, it is hard to conclude which approach performed better. 

We believe that the reported performance results for our approach are at least com- 
petitive to the best results reported by other researchers. Although our test set is relatively 
small, it consists of various low resolution images which partially have very complex 
backgrounds and include various font types and sizes. Loprestie and Zhou [8] achieved 
a better result (89.3%), but during the recognition process similar characters (e.g. “c", 
“e") were considered as one class. Sato et al. [13] reported 83.5%, but only frames from 
CNN news videos were investigated containing only two different font types. Wolf et al. 
[16] achieved a character recognition rate of 85.4% for a test set of video frames contain- 
ing 3519 characters. A comparison with [ 16] is difficult, too, since multiple consecutive 
frames were used in their approach to build a highresolution text box. 



5 Conclusions 

In this paper, we have presented an unsupervised algorithm for text segmentation in im- 
ages with complex background. First, the text color is determined using a colorquantizer 
and line histograms. Then, the R, G, and B color components and the three high-frequency 
wavelet coefficients are used as the features for the subsequent classification into text 
and background pixels. The classification is done by a slightly modified k-means algo- 
rithm. Several possible feature vector compositions were investigated on a test set of 
images, consisting mostly of single video frames with complex backgrounds and differ- 
ent font types. The best results were achieved if the resolution of the original image was 
increased from 72 dpi up to 300 dpi and then a feature vector including the color com- 
ponents R, G, and B, and the standard deviation of wavelet coefficients within a small 
sliding window was used. In this test case, 85.6% of the characters and 65.4% of the 
words were recognized correctly. The word recognition is nearly two times higher than 
the word recognition on the original images (34.1%), while the character recognition is 
about 20% higher (65.1% on the original images). 
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Fig. 1. The text segmentation and recognition results: (a) The original images; (b) The result of 
our text segmentation and binarization algorithm; (c) The OCR results of both segmented images 
using ABBYY FineReader 7.0 Professional. 



There are several areas for further research. The extension of the proposed text 
segmentation approach to videos instead of images will be considered in the future. 
Furthermore, the integration of a freely available OCR system will be investigated to 
support the whole processing chain from the input image to the ASCII-text at the end. 
The implementation of a complex system for automatic indexing of images and videos 
and their content-based retrieval will be also investigated. 
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Abstract. This paper presents the content adaptation application for mobile 
terminal within the current status of the Deferred Time Environment (DTE) of 
the DYMAS system. The DTE enables universal and personalized access to 
multimedia content over fixed and mobile networks. After introducing the sys- 
tem architecture, we show the differences between the functionalities for fixed 
and mobile networks, and detail the development of the application targeted to 
J2ME terminals. The system uses MPEG-21 for the description of sessions 
(terminal and network capabilities and user preferences) and MPEG-7 for con- 
tent descriptions, which are the base for the Annotation, Search and Browsing, 
and Transcoding appliances. 



1 Introduction 

Universal Multimedia Access (UMA) [ 1 ] refers to the capability of access to rich 
multimedia content through any client terminal and any network. The development of 
new wireless networks, providing multimedia capabilities, and a wide and growing 
range of client terminals makes the adaptation of content an important issue in future 
mobile multimedia services. 

Different authors have published about general issues and architectures for UMA 
systems [ 1 ] [2] [3] [4] [5], but there are not so many papers about prototypes, test-beds, 
or implementations (e.g., [6] [7]). The DYMAS project provides, besides a real-time 
environment for generating metadata to enable the provision of added-value MHP 
applications synchronized with content[8], an environment enabling the provision of 
alternative multimedia services based on content broadcasted in digital television 
channels[9]. These services rely on the UMA concept and associated technologies 
and standards, with a special focus on MPEG-7 and MPEG-21 (in [6] the system uses 
only MPEG-21, whilst in [7] only MPEG-7 was used). 

Section 2 introduces the DYMAS system; Section 3 presents the current architec- 
ture of the Deferred Time Environment (DTE) enabling UMA functionalities within 
DYMAS. Sections 4 and 5 describe, respectively, the different descriptions used and 
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their relation with the MPEG-7 and MPEG-21 specifications, and the functionalities 
of the J2ME mobile user applications. Section 6 concludes the paper. 



2 Overview of the DYMAS System 

Figure 1 depicts an overview of the DYMAS System architecture. It mainly describes 
a processing system with one information input (a DVB Transport Stream) and two in- 
formation outputs (the modified DVB-TS and audiovisual services directed to other 
alternative access networks). 
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Fig. 1. Block diagram of the DYMAS System 



The system relies on technology for automatic content extraction from audiovisual 
information, which is highly resource consuming, and just able to cope in real-time 
with low level basic features. These features are the basis for on-line service provi- 
sion, that is, for the Real-Time Environment (RTE) [8]. Besides real-time added value 
for interactive TV applications, the DYMAS framework also considers the provision 
of Universal Multimedia Access services that do not have a real-time requirement, but 
can conversely be offered with some delay. This is a responsibility of the Deferred- 
Time Environment (DTE). 

The DYMAS system uses the framework of MPEG-7 [10], currently mainly the 
Multimedia Description Schemes [11] specification, to provide description metadata 
of the multimedia content. Some parts of the descriptions are generated in the RTE 
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and besides their use in interactive television applications, they are stored in the 
MPEG-7 database where descriptions are enriched via manual annotation and addi- 
tional (non real time) automatic and supervised algorithms for feature extraction. 
These enhanced descriptions are the base for the UMA services provided by the DTE. 
Additionally the DTE makes use of MPEG-21 [12], mainly the Digital Item Adapta- 
tion specification! 13], to provide description metadata of the session (terminal and 
network capabilities and user preferences) in order to perform content adaptation. 



3 Current DTE Architecture 

The current DTE provides universal and personalized access to the MPEG-2 database 
over the fixed and mobile networks. At the current status of the system implementa- 
tion, the fixed applications run over Web Browsers running on standard PCs[14] and 
the mobile applications run over Java terminals[15] (support for MMS terminals will 
be also available soon[16]). Both fixed and mobile applications have the same func- 
tionalities although there are some differences because the wireless resources are 
scarce. 

The DTE is composed by two main subsystems: the Metadata subsystem in charge 
of annotating, searching and browsing, and the Transcoding subsystem in charge of 
performing content adaptation to each particular session. 

3.1 Metadata Subsystem 

The Metadata subsystem (see Figure 2) provides four client terminal specific applica- 
tions (three in the case of the J2ME clients) that run over a common set of server ap- 
plications and databases. 

• The Annotation application provides an interface to edit XML descriptions 
using a set of MPEG-7 and MPEG-21 description tools. These descriptions 
can refer to the multimedia content (according to MPEG-7) and be used to 
catalogue new multimedia content, or can be descriptions of the session, 
including user preferences, terminal and network descriptions (according to 
MPEG-21 DIA, plus MPEG-21 Digital Item Declaration). There are cur- 
rently two annotation applications. For fixed terminals (PC with web 
browsers) the annotation application allow content and session annota- 
tion^], whilst for mobile terminals only session descriptions can be gener- 
ated via the corresponding session annotation application (see section 5.1). 

• The Search application provides an automatic search including (simultane- 
ously or not depending on user selection) automatic filtering based on the 
user preferences (part of the session profile) and a query driven search. 
When the search results are available, the server sends the results to the cli- 
ent terminal in the appropriate format to be processed and presented cor- 
rectly. 
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CLIENT APPLICATIONS SERVER APPLICATIONS XML DATABASE 




• The Browsing application allows users to select content among the ob- 
tained results and to access a detailed description of the content. 

As we can see in Figure 2, the information sent to clients of fixed networks is a set 
of HTML pages, while mobile clients receive plain text that the client application in- 
terprets. 

The XSL Transformations have a great importance in the applications architecture 
[9]. In fact, the search application could be understood as a set of chained transfor- 
mations, from client request (in XML/MPEG-7 format) to server response (in HTML 
or plain text), including search, filtering and presentation processes. XSLT pattern 
matching allows the search engine to process the XML query in a natural way, trans- 
forming it in an output XML document with the results of the search. In the same 
transformation, the filtering is performed making use of the user profile description. 
At last, results are transformed one more time to format the results in order to send 
the response back to the user. 

3.2 Transcoding Subsystem 

The transcoding process modifies some media characteristics (width and high pixels 
number of the image, video bitrate, frame size, audio bitrate, sample rate, etc) of the 
MPEG video in the content database. The values of the selected parameters are ob- 
tained from the terminal capabilities and the network characteristics described in the 
MPEG-21 session profile. 

The transcoding content can be directly sent to the client through streaming, as in 
the application for wired PCs, or it can be stored in the server to be downloaded later 
through a HTTP request by the client, as in the application for J2ME mobile termi- 
nals. As expected, both approaches have advantages and disadvantages (real time, 
delay, persistent copy for further reuse, ...). 
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4 Content, Session, and Query Descriptions 

Within the system, we consider three descriptions: the content description, the session 
description, and the query description. 

The content description uses MPEG-7, and is based on the MPEG-7 Simple Pro- 
ftle[17]. The Simple Profile is based on the MPEG-7 standard but there are some re- 
stricted descriptors. Content descriptions are created using the corresponding part of 
the annotation application of the fixed terminals. Although the mobile Java terminals 
don’t have the possibility to make annotations of the content description, because they 
don’t have the editor application, they use these descriptions to obtain information of 
the multimedia contents and to carry out the search application. 

The session description uses MPEG-21 DIA (Digital Item Adaptation)[13] description 
tools. The session description is split into three subdescriptions (MPEG-21 compli- 
ant), depending on the context element to be considered. A session description con- 
tains links to a network description, a terminal description and a user description (see 
Fig. 3). Session descriptions are created using the corresponding part of the Annota- 
tion application (in a future, network and terminal will be detected automatically and 
user preferences updated automatically inspecting -if allowed- user’s usage history). 
The user description informs about the client's content preferences and the client's 
preferences presentation. The terminal description contains information about the de- 
coding, the display and the audio output capabilities, besides the power, the storage 
and the data 10 characteristics. The network description contains data transmission 
characteristics, delay and error patterns, etc. 

The query description uses a modified MPEG-7 profile which represents a partial 
content description that the user wants to match in the database. Therefore these “que- 
ries” are also XML descriptions similar to MPEG-7 content descriptions. However, as 
MPEG-7 is not designed for query, the query description is compliant with a modified 
MPEG-7 schema which includes some extra attributes (e.g., case-sensitive, just- 
included), and unconstraints some description elements. 



5 J2ME Client Applications 

The technology used in the development of the applications for mobile terminals has 
been J2ME (Java 2 Micro Edition, the specific Java platform for wireless devices), 
which allows the development of small programs called MIDlets. These small appli- 
cations should be developed keeping in mind that in the wireless world the network 
data rates and processors are slow and memory is scarce. These constraints force that 
the client applications should be designed as small as possible. 

The client applications for mobile terminals have been designed as elemental as possi- 
ble to avoid giving to the user all the complexity of navigating through too many 
screens to obtain the desired contents. 
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<?xml version="1.0" encoding="UTF-8"?> 

<DIDL xmlns="urn : mpeg : mpeg21 : 2002: 01-DIDL-NS" > 

<Component> * 

<Descriptor> ^ * 

<Statement mimeType="text/plain''>UserCharacteristics</StatSJTl6nt> 
</Descriptor> ^ *' 

<Resource mimeType="text/xml" ref =" User/ AnaG. xml" /> 

</Component> 

<Component> 

<Descriptor> 

<Statement mimeType="text/plain">TerminaICapabilities</Statement> _ 
</Descriptor> 

<Resource mimeType="text/xml" ref="Terminal/Nokia6310i.xmr/> 
</Component> 

<Component> 

<Descriptor> 

<Statement mimeType="text/plain”>NetworkCharacteristics</Statement> 
</Descriptor> '* 

<Resource mimeType="text/xml" ref = ' Network/ GPRS. xml" / > ** >*. 

</Component> 

</Item> 

</DIDL> 



Ana. xml 



<?xml version="1.0" encoding="UTF-8"?> 

<DIA xmlns:dia="urn:mpeg:mpeg21:dia:schema:2003"> 
<Description xsi:type="UserCharacteristicsType"> 
<UserInfo xsi:type="mpeg7:PersonType"> 

</UserInfo> 

</Description> 

</DIA> 



User/AnaG.xml 



<?xml version="1.0" encoding="UTF-8"?> 

<DIA xmlns:dia="urn:mpeg:mpeg21:dia:schema:2003": 
description xsi:type="TerminalCapabilitiesType"> 
<Codec>.... </Codec> 

</Description> 

</DIA> 

Terminal/ Nokia63 10 i. xml 



<?xml version="1.0" encoding="UTF-8"?> 

<DIA xmlns:dia=''urn:mpeg: mpeg21 :dia : schema : 2003"> 
<Description xsi:type="NetworkCharacteristicsType"> 

<Capability maxCapacity="115000" minGuaranteed="40000"/> 



Red/GPRS. xml 



Fig. 3. Session description and subdescriptions 



5.1 Session Annotation Application 

The first thing the user has to do to be able to use the application is to register with a 
session profile. When the user is registered, he/she identifies his/her preferences, the 
terminal capabilities and the networks characteristics that he/she uses. With this in- 
formation stored in the database, the application can obtain the values to configure the 
transcoding of the multimedia contents. If the user doesn't have a session profile as- 
signed, he/she can create it with the session annotation application, which is available 
both for users of fixed networks and for users of mobile networks. 

The creation of a session description consists on editing or selecting three different 
descriptions (user preferences, terminal capabilities and network characteristics) as it 
has been explained above. Selection of a created description, which are stored at the 
server, is done via a list. The user has the possibility of inspecting the descriptions to 
decide if it matches his/her desired session profile. To edit a description the user in- 
troduces the values of the fields and the application internally produces the XML de- 
scription using a parser. When the description has been created, it is sent to the server 
for persistent storage. Each new description can be reused by future users. 

5.2 Search and Browsing Applications 

The search application allows the user to create an XML query description (see 
above) with the information introduced in the available fields, giving the option of im- 
porting the user preferences annotated in the session profile. Once the request has 
been sent to the server, it is processed and a set of results is sent back to the mobile 
device. 

The client application receives the results of the search in plain text, interprets them 
and generates a list. When the user selects an element of the list, all the related meta- 
data available from the content description appears on the screen. 
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Fig. 4. Session annotation application 
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Fig. 5. Search and Browsing applications 



5.3 Transcoding Application 

Before doing the request of the selected video, all the parameters, which are used to 
configure the transcoding process, are displayed to the user. As it has been said previ- 
ously, the values of the parameters have been obtained from the session profile. This 
session profile is the one that the user has configured when registering to the applica- 
tion. These values are not the definitive ones because the user can modify them inside 
a range of available values depending on particular circumstances of the running ses- 
sion. This range of values is usually limited by the network characteristics or the ter- 
minal capabilities. 




Fig. 6. Transcoding Application: configure and play video 
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The last step to play the transcoded video is to send to the server the configuration 
parameters for the transcoding process, that will only be carried out if there is not an 
available copy in the variations cache database with the required media characteristics. 
After transcoding and storing the new variation in the cache database, if these steps 
were required, the application downloads the video to the terminal and the video be- 
gins to reproduce automatically. The player allows the user to carry out several basic 
operations with the video, for example, stop it, rewind it, put it to full screen, mute 
audio, etc. 



6 Conclusions 

As we can see, nowadays the users who want to get multimedia contents can use dif- 
ferent types of networks and terminals. The main objective of UMA concept is to 
adapt these multimedia contents to the networks characteristics and the terminal ca- 
pabilities. The current DYMAS system integrates both fixed access service with web 
browsers and wireless access with J2ME mobile terminals. 

Although mobile terminals tend to have the same functionalities than fixed terminals, 
there are some differences because the network data rates and processors are slow and 
memory is scarce. Both services (fixed and mobile access) include a session annota- 
tion application to describe a session profile, and a search and browsing application. 
The content annotation application which allows cataloguing content based on the 
MPEG-7 standard is only available for the fixed access service. With the session an- 
notation applications the user can describe his/her preferences, the terminal capabili- 
ties and the networks characteristics with an MPEG-21 description. This information 
configures the transcoding system when the search application finds the multimedia 
content that the user wishes. 

Preliminary testing of the currently implemented and integrated applications (web for 
fixed PCs and J2ME applications) indicates that the MPEG-2 database can be ac- 
cessed via alternative networks and services providing content adaptation to terminal 
characteristics and personalization to user preferences, that is, UMA functionalities. 
The main problem still remains the requirement for “timely” annotation of content for 
providing the personalization to user preferences. After final integration of the support 
for MMS terminal, testing and validation of the current version of the UMA function- 
alities will follow. 
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Abstract. This paper presents a task-based user evaluation of two content- 
based image database browsing systems. The performance of the two systems is 
compared to that of a commercial image database management program, which 
does not employ content-based information. Experimental results show that 
content-based cues improve the efficiency of the browsing considerably. 
Guidelines for system design are derived from the user feedback. 



1 Introduction 

Literature proposes numerous methods for assessing the usability and performance of 
interactive systems such as image database search and browsing systems [10,12]: 
observation, think aloud, questionnaires, interviews, focus groups, logging actual use, 
user feedback, heuristic evaluation, pluralistic walk-through, formal usability inspec- 
tion, empirical methods, cognitive walkthroughs, formal design analysis, etc. Despite 
several decades of retrieval experiments, the early quantitative measures of precision 
and recall are still the most widely adopted approaches. [1,4,13] 

Qualitative user tests allow researchers to obtain knowledge how users perceive the 
system’s performance and usability. The general nature of search tasks in visual in- 
formation systems degrades the value of synthetic testing, such as [3], in real opera- 
tional environments. However, the artificially generated performance numbers have 
an important role in the selection of technological alternatives for the system. 

This paper presents task-based user evaluation of two content-based image brows- 
ing systems IIRo [11] and PicSOM [7]. The performance of the two systems is com- 
pared to that of the commercial ACDSee program [14], ACDSee does not employ any 
content-based methods, but search and management of images is based on efficient 
browsing of thumbnail images. 
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2 Browsing Systems 
2.1 IIRo 

The IIRo system [11] is based on unsupervised clustering of content-based metadata 
using self-organizing maps (SOM, [5]). Information visualization techniques, multi- 
resolution object layers and a zoom view based on the focus+context technique [2] 
are employed to improve the user's interaction with the browsing system. The user is 
offered a simultaneous focused presentation of the selected object and other similar 
ones, while still maintaining a view to the entire database, which prevents potential 
straying of the user during browsing. 

The multi-resolution index structure of the IIRo system is implemented with a self- 
organizing map, which is trained with the content-based metadata extracted from the 
objects of the database. The SOM provides topological ordering of the objects at the 
so-called root level, from which the browsing level is obtained by subsampling in 
both horizontal and vertical directions so that objects residing in the nodes within a 
given area are pooled into a single node in the browsing level. Subsampling is dy- 
namic, producing a desired browser view. 
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Fig. 1. The graphical user interface of the IIRo system. 



IIRo’s graphical user interface is illustrated in Fig. 1. The UI is a single window, 
which is divided horizontally into two parts. The left part contains miniature visuali- 
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zations of the candidate views into the database constructed with different metadata. 
In the right part are the actual browsing view and the detailed view showing the im- 
ages in the current target node. 

In the browsing view each node is visualized by one of the images residing in the 
node. The target image, the image visualizing the current target node (focus) is high- 
lighted, and the browsing view is panned so that the target node is close to the center 
of the view. The target image is highlighted simultaneously in all visualizations ac- 
cording to the linking+brushing metaphor, to help the user comprehend the relation- 
ships between different visualizations. 

The browsing view is used solely with a pointing device (mouse) so that the node 
of interest is selected with the right button. When a node is selected, one of the im- 
ages residing in the node is visualized as the target image. If the current target node is 
re-selected, then a different image is chosen as the target image. The user is informed 
about the number of images in a single node with a row of dots above the image se- 
lected to visualize the node. By double-clicking the target image, the user can visual- 
ize it in full size in a separate window according to the details-on-demand metaphor. 

The browsing view includes an optional zoom view based on the focus+context 
technique. With the left button of the mouse the user can fire up the zoom view, 
which provides a more detailed view of the images close to the current target image. 
To avoid confusing the user with the appearance of the zoom view, it is created as an 
animation, originating from the current target image and slowly expanding to its full 
size. The zoom view has a lighter background color to distinguish it from the brows- 
ing view. Individual images are visualized as thumbnails: 20x20 pixels in the panel of 
candidate views, 100x100 in the browsing view, and 150x150 in the center of the 
zoom view. 



2.2 PicSOM 

The PicSOM system is a framework for generic research on algorithms and methods 
for CBIR, using the self-organizing map as the basic image indexing method. A more 
detailed description of the system and results of experiments performed with it can be 
found in [7,8,9]. For computational reasons, PicSOM uses a special form of the SOM 
algorithm, the Tree Structured Self-Organizing Map (TS-SOM) [6]. The hierarchical 
structure of TS-SOM reduces the complexity of training large SOMs by exploiting 
the hierarchy in finding the best-matching map unit (BMU) for an input vector. In 
addition, the produced hierarchical representation of the image database can be util- 
ized in browsing and visualizing the images in large databases. 

The main image retrieval method of PicSOM is query by examples in which the 
image query is iteratively refined based on the user's relevance feedback. The rele- 
vance information is mapped from the images to the corresponding BMUs on the 
used SOMs and then spread to the map neighborhoods. This way, one obtains areas of 
positive and negative responses, which are illustrated with red and blue colors, re- 
spectively, on the user interface. For determining the images shown on the next query 
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round, PicSOM supports multiple features as the responses from the parallel SOMs 
are combined automatically, although only one feature was used in this study. 
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Fig. 2. The graphical user interface of the PicSOM system. 

Fig. 2 shows the user interface of PicSOM during an example query for landscape 
images based on a color correlogram feature. First, the responses of the user's previ- 
ous relevance evaluations are visualized on the feature map. The currently relevant- 
marked images are shown next and the images returned as best-scoring ones on this 
query round are shown below. The checkboxes beside the images are used for mark- 
ing the relevance of the images to the current query. In addition, the user can at any 
time switch to image browsing by clicking on interesting locations on the used SOMs. 
Then, the corresponding portion of the SOM surface is displayed with a navigational 
aid for further browsing. 



3 Task-Based User Evaluation 

The performance and usability of the browsing systems was evaluated with a task- 
based user evaluation. The goal of the evaluation was to quantify how well the users 
are able to carry out various tasks solely based on automatic content-based methods, 
without prior manual annotation or categorization of the database. Further, the 
evaluation was expected to identify possible usability problems and provide valuable 
feedback for improvements. 
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3.1 Test Arrangement 

Setup. The user evaluation involved 20 test users, mostly research personnel and a 
few students. The test users were required to have smooth computer skills and abso- 
lutely no prior experience of either IIRo or PicSOM. 

The image database used in the evaluation contained 10144 images from 150 dif- 
ferent subject matters in the CorelGALLERY collection. The subject matters were 
chosen so that the resulting database would be versatile and representative of a typical 
end user image database. Color correlograms extracted from the images were used as 
the metadata in IIRo and PicSOM. ACDSee does not utilize any content-based meth- 
ods, but visualizes the images in a random layout. 

The evaluation was conducted on a regular desktop PC, which had a 19 inch color 
monitor set to 1600x1200 pixel resolution. The graphical user interface of each sys- 
tem was set to cover the complete screen. Further, each system was adjusted to have 
the same background color, to eliminate any impact by contrast differences. The users 
had a mouse and a keyboard at their disposal. 

The IIRo system used one multi-resolution SOM, where the root level had 90x90 
nodes. The query-by-example functionality was disabled, to force the users to rely on 
the browsing method exclusively. The size of the browsing view was set to 100 im- 
ages as a 10x10 grid. The size of the thumbnail images on the browsing view was set 
to 100x100 pixels. 

The browsing view of the PicSOM system was set to 6x5 images, so that in terms 
of number of images it was roughly identical to the zoom view of the IIRo system (30 
vs 33). This number of images could also be represented in the user interface simulta- 
neously without scroll bars. The size of the thumbnail images was 120x90 pixels, the 
default setting of the PicSOM system. 

In the ACDSee system the size of the thumbnail images was set to 100x100 pixels. 
Before the start of the test all images were loaded to the ACDSee so that the system 
did not have to load them during the browsing. The initial placement of the images on 
the browsing view was drawn randomly for each test user. 

Test procedure. The 20 test users were first randomly divided into two groups (A and 
B) of 10 subjects. Group A evaluated first PicSOM and then IIRo, and group B vice 
versa, to eliminate the effect of learning. ACDSee was always the last system evalu- 
ated. At the beginning the test users were provided with written instructions and an 
introductory search task to familiarize them with the browsing system in question. 
Once the introductory task was completed, the actual evaluation commenced. 

Each test user was asked to carry out five tasks described below. Each task was de- 
fined on a separate sheet of paper which the test user was allowed to study for an 
arbitrary amount of time. Once the test user told to be ready to carry out a task, the UI 
of the system being evaluated was exposed to the user. A task was deemed completed 
once the target image(s) had been found. The test user was also allowed to forfeit a 
task. After having completed the fifth and last task, the test user was asked to fill in a 
paper questionnaire. Having completed the questionnaire for one system the test user 
moved on to the evaluation of the next system. The ACDSee system was evaluated 
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only with task Tl, since completing all five tasks would have taken too much time. 
After evaluating all three systems the test user filled in a second questionnaire, where 
(s)he was asked to rank the three systems, and explain the ranking. 

Tasks. The test users were asked to carry out the following five tasks: 

Tl: Find the image illustrated in Fig. 3, 

T2: Find an image of night-time sky, 

T3: Find five images of desert, 

T4: Find an image of the Statue of Liberty against a blue sky, 

T5: Find an image of a violin. 




Fig. 3. Target image used in task Tl. 



3.2 Results 

The following measurements were recorded: 

t search time elapsed in carrying out a single task, 

V the number of new images visualized in the UI during a single task, 

T user’s subjective assessment of the tryingness of the performance of 

a single task on scale l=effortless ... 8=very trying, 

F the number of forfeited tasks per user, 

S user’s subjective assessment of the satisfaction of the functionality 

of IIRo and PicSOM systems on scale l=very unsatisfied ... 8=very 
satisfied. 

In the following analysis symbols refer to the results obtained with group 

(A or B ) for system (A=ACDSee, 7=IIRo, / J = PicSOM). We first study R '■ and R' B , 
which are regarded as the most unbiased results, since they do not include any effect 
of learning by carrying out the tasks with the other system beforehand. 

Fig. 4(a) shows the median task wise search times for PicSOM and IIRo and their 
average (we use median to eliminate the effect of ‘outliers’). We see that in the first 




240 



T. Ojala et al. 



four tasks IIRo provides 10-20 seconds shorter search times, whereas task T5 was 
very difficult for IIRo. The reason for this was the task definition, which intentionally 
did not include any reference to the desired color distribution. PicSOM’s relevance 
feedback mechanism also proved useful in this task, resulting in a quicker conver- 
gence of the search space. 
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Fig. 4. Assorted results (see text for explanation). 



Fig. 4(b) shows the task wise tryingness estimates and their average. They have 
roughly the same pattern as search times, excluding task T3, where given search times 
IIRo is judged to be surprisingly effortless relative to PicSOM. The reason is that 
IIRo typically shows simultaneously a number of similar images. Hence, having 
found one desired image the user is likely to have found several of them (just as was 
the setting in task T3), which compensates for the possibly long search time in find- 
ing the first correct image. 
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Fig. 4(c) shows the median of the new images shown to the user in the UI in each 
task and their average. We see that on average with IIRo the user is shown 140 im- 
ages more than with PicSOM. This is mostly explained by the fact that the starting 
screen of IIRo has 70 images more than that of PicSOM. 

Next we study results compiled over all 20 test users, i.e. results obtained for 
groups A and B are aggregated. Fig. 4(d) shows the median search times for each 
tasks and their average. We see that in task T1 the median search time for ACDSee 
was about three times longer than for IIRo or PicSOM, although the task was straight- 
forward in the sense that spotting the desired image should have been easy due to its 
distinct colors. In task T4 the difference would probably have been even greater in 
favor of the content-based browsing systems. 

The aggregate tryingness estimates in Fig. 4(e) show that IIRo and PicSOM are 
found roughly equally effortless to use, whereas ACDSee is found very trying. This is 
also demonstrated by the number of forfeited tasks shown in Fig. 4(f). Five out of 20 
test users gave up in task Tl, while IIRo and PicSOM scored only three and five 
forfeits in total, respectively. The poor results for ACDSee demonstrate the usefulness 
of content-based information in browsing a large image database. 

The learning taking place during the evaluation can be seen by comparing the me- 
dian search times of the two groups. The group evaluating a system as the second 
system scored lower median search times than the group evaluating that system as the 
first system, in case of both systems. If we look at the rankings of the three systems, 
in both groups the content-based system that was evaluated first got a slightly worse 
total ranking than the system evaluated second. Similarly, on average test users gave a 
slightly larger satisfaction estimate to the system evaluated second (PicSOM 
5.0— >5.6, IIRo 5.2 — >5.3). Possibly, these can also be attributed to learning. ACDSee 
system was ranked as the last one by all test users. 

3.3 Guidelines for System Design 

The user evaluation and the feedback from the test users produced important sugges- 
tions for system design. Too small thumbnails (100x100 pixels on 19 inch monitor set 
to 1600x1200 resolution) were the biggest shortcoming in both IIRo and PicSOM. 
Another major problem reported by the test users was the ‘semantic gap’ between the 
color-based clustering of images by the systems and the high-level categorization of 
images the users preferred to impose. 

Following usability problems were identified for PicSOM: difficulties in finding a 
suitable example image among the images in the starting screen (could be addressed 
by using a larger starting screen), the user may forget to remove bad images from the 
collection of query images, which results in weaker performance, poor design of the 
button used for selecting an image as a good example, and no “return to beginning” 
button. 

Following usability problems were identified for IIRo: zoom view occluded too 
many images in the browsing view, difficulties in recognizing that a node contained 
multiple images, mouse buttons were overloaded with too many functions, and no 
possibility to reverse to the previous visualization. 
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4 Conclusion 

This paper presented a thorough task-based user evaluation of two content-based 
image database retrieval systems and the commercial ACDSee program. The results 
show that content-based information is very useful in browsing a large image data- 
base. Useful guidelines for future system design were also obtained. 
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Abstract. In this paper we describe ImageCLEF 1 , the cross language image re- 
trieval track of the Cross Language Evaluation Forum (CLEF 2 ). We instigated 
and ran a pilot experiment in 2003 where participants submitted entries for an 
ad hoc bilingual image retrieval task on a collection of historic photographs 
from St. Andrews University Library. This was designed to simulate the situa- 
tion in which users would express their search request in natural language but 
require visual documents in return. For 2004 we have extended the tasks to in- 
clude a medical image retrieval task and a user-centred evaluation. 



1 Introduction 

A great deal of research is currently underway in the field of Cross Language Infor- 
mation Retrieval (CLIR) where documents written in one language are retrieved by a 
query written in another (see, e.g. [11] and [16]). One can consider CLIR as basically 
a combination of machine translation (MT) and traditional monolingual information 
retrieval (1R). Most CLIR research has focused on locating and exploiting translation 
resources with which the user’s search requests or target documents (or both) are 
translated into the same language. Campaigns such as the Cross Language Evaluation 
Forum (CLEF) [16] and the Text REtrieval Conference (TREC) [20] multilingual 
track have helped encourage and promote international research, as well as create 
standardised resources for CLIR evaluation. 

However, one area of CLIR research which has received less attention is image re- 
trieval. In collections such as historic or stock-photographic archives, medical case 
notes and art/history collections, images are accompanied by some kind of text (e.g. 
metadata or captions) semantically related to the image [2] [12]. Images can then be 
retrieved using standard IR methods based on textual queries. However, retrieval from 
an image collection offers distinct characteristics from one in which the document to 
be retrieved is natural language text [ 1][10], For example, the way in which a query is 
formulated, the method used for retrieval (e.g. based on low-level features derived 



1 ImageCLEF: http://ir.shef.ac.uk/imageclef2004/ 

2 CLEF: http://www.clef-campaign.org 
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from an image, or associated text), the types of query, how relevance is assessed, the 
involvement of the user during the search process, and fundamental cognitive differ- 
ences between the interpretation of visual versus textual media. Methods of image re- 
trieval are typically based on visual content 3 (e.g. colour, shape, spatial layout and 
texture), or by text/metadata associated with the image (see, e.g. Smeulders et al. [ 18] 
and Goodrum [10]). 

For those organisations managing image repositories in which text is associated 
with images (e.g. on-line art galleries), one way to exploit these is by enabling multi- 
lingual access to them. To promote research in this area we instigated ImageCLEF [5] 
as part of the CLEF campaign. We felt this contribution would address an important 
and timely problem not dealt with by existing cross language evaluation. We envis- 
age ImageCLEF will appeal to both commercial and academic research communities 
including: cross language information retrieval, image retrieval, and user interaction. 
The main aims of the ImageCLEF campaign are: (1) to promote and initiate interna- 
tional research for CL image retrieval, (2) to further our understanding of the relation- 
ships between CL texts and images for IR, and (3) to create a set of useful standard- 
ised resources for CL image retrieval to scientific communities in the whole. 

The paper divides into the following: in section 2 we describe the ImageCLEF 
2003 test collection for an ad hoc retrieval task, in section 3 we describe tasks offered 
in ImageCLEF 2004 and finally in section 4 we summarise the contents of this paper 
and provide some ideas for future work in cross language image retrieval. 

2 Building a Test Collection for Multilingual Image Retrieval 

Evaluation of retrieval systems is either system-focused, e.g. comparative perform- 
ance between systems or user-centered, e.g. a task-based user study. For many years 
IR evaluation has been dominated by comparative evaluation of systems in a com- 
petitive environment. The design of a standardised resource for IR evaluation was 
first proposed over 30 years ago by Cleverdon [4] and has since been used in major 
IR conferences such as TREC [20], CLEF [16] and NTCIR [3]. Over the years the 
creation of a standard test environment has proven invaluable for the design and 
evaluation of practical retrieval systems both within and outside a competitive envi- 
ronment. The main components of a TREC-style test collection are: (1) document 
collection, (2) topics, and (3) relevance assessments. 

In TREC, NTCIR and CLEF, participants are given test collection data and topics 
and asked to submit their entries. A subset, chosen by the organisers, is used to create 
document pools, one for each topic. Domain experts (assessors) are then asked to 
judge which documents in the pool are relevant or not. Document pools are created 
because in large collections it is infeasible to judge every single document for rele- 
vance. These assessments are then used to assess the performance of submitted sys- 
tems. User-centred evaluation is important to assess the overall success of a retrieval 
system which takes into account other factors other than just system performance, e.g. 
the design of the user interface and system speed (Dunlop argues this in [7]). A num- 
ber of researchers have highlighted the advantages of user-centred evaluation, par- 
ticularly in image retrieval systems (see, e.g. [10], [14] and [7]). One of the main aims 
of ImageCLEF is to provide both the CLIR and image retrieval communities a num- 



3 These are called Content-Based Information Retrieval (CBIR) systems. 
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ber of useful resources (datasets and relevance assessments) to facilitate and promote 
further research in multilingual image retrieval. 

Calls for a TREC-style evaluation for image retrieval systems have been suggested 
[10][15][19], although Forsyth [9] argues that the evaluation of CBIR systems at the 
moment is useless because systems are too bad (hence the interest in combining both 
textual and visual features). We are unaware of existing test collections for CL image 
retrieval, although evaluation resources do exist to evaluate specific image retrieval 
tasks, e.g. journalism [13] and CBIR systems, e.g. Benchathalon 4 . One of the largest 
obstacles in creating a test collection for public use is securing a suitable collection of 
images for which copyright permission is agreed. This has been a major factor influ- 
encing the datasets used in the ImageCLEF campaigns. The ImageCLEF test collec- 
tion provides a unique contribution to publicly available test collections and comple- 
ments existing evaluation resources. 



2.1 The Existing ImageCLEF Test Collection 

Because CL image retrieval encompasses at least two research areas: (1) image re- 
trieval and (2) CLIR, building a comprehensive and suitable test collection is a tall 
order. Therefore, in 2003 we organised a pilot experiment at CLEF with the following 
aim: given a multilingual statement describing a user need, find as many relevant im- 
ages as possible. More formally the task was a bilingual ad hoc retrieval task in which 
a static collection was searched using previously unseen topics. 

The retrieval task was designed to simulate the situation in which a user expresses 
their need in a language different from the collection, requiring a visual document to 
fulfil their search request (e.g. searching an on-line art gallery or stock photographic 
collection). For this retrieval task query translation is the preferred method of bridging 
the language gap as translating the collection would be both time and resource expen- 
sive and less likely in practice. Participants were not constrained in their use of re- 
trieval method, enabling either text or content-based searches (or a combination of 
both). As a retrieval task there are several challenges other than translation which in- 
clude: (1) captions typically short in length, (2) images of varying content and quality, 
(3) bridging the gap between colloquial and domain-specific language used in the 
captions and cross language queries, and (4) queries short in length thereby providing 
little context for translation. 

The dataset used consisted 28,133 historic photographs from the library at St An- 
drews University [17], All images are accompanied by a caption consisting of 8 dis- 
tinct fields which can be used individually or collectively to facilitate image retrieval 
(see Fig. 1). The 28,133 captions consist of 44,085 terms and 1,348,474 word occur- 
rences; the maximum caption length is 316 words, but on average 48 words in length. 
All captions are written in British English and contain colloquial expressions and 
historical terms. Approximately 81% of captions contain text in all fields, the rest 
generally without the description field. In most cases the image description is a 
grammatical sentence of around 15 words. The majority of images (82%) are black 
and white, although colour images are also present. 



4 http://www.benchathlon.net/ 
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Record ID: JV-A.000460 

Short title: The Fountain, Alexandria. 

Long title: Alexandria. The Fountain. 

Location: Dunbartonshire, Scotland 
Description: Street junction with large ornate fountain 
with columns, surrounded by rails and 
lamp posts at corners; houses and shops. 

Date: Registered 17 July 1934 
Photographer: J Valentine & Co 

Categories: [ columns unclassified ][ street lamps - or- 
nate ][ electric street lighting ][ shepherds & 
shepherdesses ][ streetscapes ][ shops ] 
Notes: JV-A460 jf/mb 




Fig. 1 . An example image and caption (see: http://www-librarv.st-andrews.ac.uk ). 

We generated fifty representative search requests in English (called topics) and 
translated them into 6 different languages: Dutch, Spanish, German, French, Italian 
and Chinese (provided by the National Taiwan University or NTU). In TREC, CLEF 
and NTCIR final topics are chosen from a pool of suggestions generated by searchers 
familiar with the domain of the document collection. Frequently searched subject ar- 
eas in the St Andrews were identified by analysing log files generated from accesses 
to a web search engine used by the library. Based on these subject areas we created 
queries that would test the capabilities of both a translation and image retrieval sys- 
tem, e.g. pictures of specific objects versus pictures containing actions, broad versus 
narrow concepts, topics containing proper names, compound words, abbreviations, 
morphological variants and idioms. Each topic consisted of a short title, a longer nar- 
rative describing the search request and an exemplar relevant image. For ImageCLEF 
2003 only topic titles were translated due to limited resources available to us. 



2.2 Relevance Assessments and Evaluation 

What turns a set of documents and queries into a test collection are the relevance 
judgments, manual assessments of which documents are relevant or not for each 
topic. Judging whether an image is relevant or not is highly subjective (e.g. due to 
knowledge of the topics or domain, different interpretations of the same document, 
and searching experience), therefore to minimise this two assessors judged each topic. 

We adopted the pooling method as used in TREC, CLEF and NTCIR where a set 
of candidate documents is created (called the pool ) by merging together the results of 
the top n documents from the ranked lists provided by participants. This assumes that 
highly ranked documents from each entry will contain relevant documents. Ideally, 
ranked lists should come from a diverse range of systems to ensure maximal cover- 
age. We also supplemented the pooling method with manual interactive searches (also 
known as interactive search and judge or ISJ) to ensure good quality pools (as used in 
NTCIR). We found assessors were able to judge the relevance of images very quickly 
(especially eliminating non-relevant ones) enabling all ImageCLEF submissions to be 
used in creation of the pools (compared to a subset of runs for text-based assessment). 
One of the authors familiar with the collection assessed all fifty topics to provide a 
„gold“ set of judgments; in addition, ten assessors from the University of Sheffield 
judged five topics each to provide a second judgment for each topic using a custom- 
built assessment tool. 
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Images were judged relevant if any part of the image was deemed relevant. Pri- 
mary judgment was made on the image, but assessors also consulted the image cap- 
tions. Assessors were asked to judge the relevance of images using a ternary scheme: 
relevant, partially relevant and not relevant to deal with potential uncertainty in the 
assessor's judgment (i.e. it is possible to determine that the image is relevant, but less 
certain whether it exactly fulfils the need described by the topic). Unlike other test 
collections we provided four sets of relevance assessments (called qrels ) - 
strict/relaxed union/intersection - with which to assess system performance based on 
the overlap of relevant images between assessors and whether the relevance sets in- 
clude images judged as partially relevant or not. These are further described in [5]. 
The strict relevance set can be contrasted with a high-precision task; the relaxed set 
providing an assessment that promotes higher recall. 



2.3 Results and Lessons Learned 

Four groups entered ImageCLEF 2003: Sheffield University, NTU, University of Sur- 
rey and Daedalus, a Spanish R&D organization. All participants used text-based re- 
trieval methods with no content-based image analysis. Results from ImageCLEF have 
shown that in general CL image retrieval using query translation can achieve rela- 
tively high performance for the suggested bilingual search task. However, we found 
retrieval performance to vary dramatically across both language and topic. The high- 
est result was obtained for French (78% of monolingual); the worst for Chinese (51% 
of monolingual) indicating there is still room for improvement. In particular, en- 
hancement to deal with poor retrieval caused by translation errors is required. Results 
from ImageCLEF showed: for Chinese retrieval transliteration of proper names was 
beneficial, and for other languages thesaurus-based query expansion improved per- 
formance. ImageCLEF was effective at attracting new research groups to CLEF and 
this year is advertised as an entry-level CLIR task. 

Based on our experiences from last year we have made the following changes to 
the ImageCLEF track: (1) to offer greater diversity we have added a medical retrieval 
task, (2) to promote ImageCLEF as an entry-level CLIR task we are offering topics in 
12 languages rather than 6, (3) to encourage participants to exploit visual features we 
have setup public access to a default CBIR system, (4) due to ambiguity in relevance 
assessments we have selected more specific topics including queries refined by pho- 
tographer, location and date (general queries such as „mountain scenery 11 retrieved too 
many images and were too laborious to assess), and (5) we are using relevance asses- 
sors familiar with the collection (this includes native English speakers who are famil- 
iar with colloquial English/Scottish terms, e.g. „perambulator“). 



3 The ImageCLEF 2004 Track 

3.1 The Bilingual Ad Hoc Retrieval Task 



A bilingual ad hoc task similar to that ran in 2003 is being offered to participants to 
enable further experiments on the St. Andrews dataset and determine whether im- 
provements can be made on last year’s results. Experiments will compare: (1) differ- 
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ent methods of query translation (e.g. dictionary-lookup versus MT), (2) query expan- 
sion (e.g. global versus local methods), (3) the use of text-based and CB1R methods 
used either separately or combined, (4) different retrieval models, (5) different in- 
dexing methods (e.g. indexing all or some fields) and (6) manual vs. automatic rele- 
vance feedback. 

A new set of 25 topics has been produced in the same manner as before (decide on 
general topics and then refine). However, in addition to using St. Andrews query logs, 
we also used subject areas supplied by staff from St. Andrews’ library. Topic refine- 
ment is based on the query categorisation scheme suggested by Armitage et al. [1] for 
picture archives and designed to test a range of different CL and image search pa- 
rameters. Topics have been translated into the previous languages, plus Japanese, 
Danish, Russian, Finnish, Swedish and Arabic. One non-intentional but interesting 
„feature“ of translated topics in ImageCLEF 2003 was the introduction of translation 
errors, e.g. spelling mistakes and erroneous diacritics, resulting in low retrieval per- 
formance for some topics. These problems are not addressed by existing CLEF tasks. 
We will provide two sets of topics: one set will contain spelling errors; the other will 
be checked and free of such errors. 



3.2 The Medical Image Retrieval Task 

To offer participants a different domain/scenario and encourage the use of CBIR sys- 
tem we have introduced a task based on medical retrieval. In the ad hoc task it is the 
query which is multilingual; in the medical retrieval task the document collection is 
multilingual presenting different CLIR challenges. 




Fig. 2. Example images from the Caslmage dataset f http://www.casimage. com/ 1 



In general, medical practitioners are unsatisfied with retrieving images by text and 
the implicit knowledge stored in the images plus attached text is rarely used. As a di- 
agnostic aid, being able to search a database of images with a new example would en- 
able them to obtain more evidence. The goal of this task is to investigate the use of 
CBIR and text-based retrieval systems for this kind of medical retrieval task. The task 
is being run by University Hospitals of Geneva who are supplying the medical data, 
topics and relevance judgments. The medical task is this: given an example image, 
find similar images which will be helpful in confirming the initial diagnosis. Because 
the initial retrieval has to be visual, we expect the case notes to be useful in finding 
additional similar images complementary to CBIR. We also aim to evaluate whether 
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relevance feedback can improve performance, compare relevance feedback using ei- 
ther image/text or both, and whether images alone can be used for pseudo relevance 
feedback. 

The dataset (Caslmage) consists of 8,751 anonymised medical images, e.g. scans, 
and x-rays (see Fig. 2). The majority of images are associated with case notes, a 
written description of a previous diagnosis for an illness the image identifies. Case 
notes consist of several fields including: a diagnosis, a description, clinical presenta- 
tion, keywords and title. The task is multilingual because case notes are mixed lan- 
guage written in either English or French. Not all case notes have entries for each 
field and the text itself reflects real clinical data in that it contains mixed-case text, 
spelling errors, erroneous French accents and un-grammatical sentences. In the da- 
taset there are 2,078 cases to be exploited during retrieval (e.g. query expansion). 

Currently 25 example images (topics) have been chosen as representative from the 
dataset. A set of ground truths for each topic has already been identified by domain 
experts based on the CBIR system developed by the third author 5 and these will form 
part of the document pools created from participant’s entries. Pools will be formed in 
a manner similar to the ad hoc task and medical practitioners will help j udge the rele- 
vance of the pools after final submissions. In this task images are judged using a bi- 
nary relevant or not relevant judgement and assessments will be used to evaluate par- 
ticipant’s entries. This retrieval task offers a number of challenges including: (1) 
combining text and content-based methods of retrieval after an initial visual search, 
(2) dealing with domain-specific medical terminology, (3) case notes of varying qual- 
ity in more than one language (i.e. a mixed language index), and (4) the high cost of 
returning non-relevant images (i.e. mis-diagnosis) which is always inevitable when 
using visual-only search methods. 



3.3 The Interactive Retrieval Task 

Campaigns such as iCLEF 6 have shown the value of user-centred evaluation for CLIR 
and CL image retrieval would seem to be a rich source for user-centred experiments. 
Past research has shown that the search activities of a user in an image retrieval sys- 
tem vary between searching for specific images and browsing the image collection 
(see, e.g. [10] and [6]). For a CL image retrieval system, the issue is how best the 
system can support the user’s search in locating relevant images as quickly, easily and 
accurately as possible. User-centered evaluation in a variety of contexts and domains 
will help us determine how CL image retrieval systems can best help users to: (1) 
formulate their queries (e.g. whether text or visual queries alone are best or can be 
used in combination), (2) refine the search request - query reformulation will depend 
on the outcome of the system and could involve refinements using textual and/or vis- 
ual features, (3) browse the collection, and (4) identify relevant images (e.g. what ad- 
ditional information would help the user judge the relevance of an image and how 
best is this displayed). 

Cox et al. [6] suggest three classes of image search: (1) target or known-item 
search (i.e. find a specific image), (2) category search (e.g. „find pictures of the Eiffel 
Tower“) and (3) open-ended browsing (i.e. wandering through the collection). They 



5 See http://viper.unige.ch/ for a list of publications about the VIPER CBIR system. 

6 See http://terral.lsi.uned.es/iCLEF/ for information about iCLEF. 
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argue that the target search encompasses the other categories of search; it is simple for 
the user to perform and has clear measures of effectiveness. The goal for the user in 
such a task is given an image to find it again from the collection. Unlike being given a 
textual topic description, the user must interpret the given image and generate suitable 
query terms in a given language (different from the document collection). The sce- 
nario models the situation in which a user searches with a specific image in mind 
(perhaps they have seen it before) but without knowing key information thereby re- 
quiring them to describe the image instead, e.g. searches for a familiar painting whose 
title and painter are unknown. This task will use the St. Andrews dataset and our ex- 
perimental setup will follow the guidelines for user-centred experiments as suggested 
by iCLEF. This task will be undertaken with collaboration from iCLEF organisers to 
ensure a consistency in CLEF methodologies. Participants are asked to follow the ex- 
perimental setup but can perform whatever experiments they like. 

A minimum of 8 users and 8 topics are required for this task. Users are given 10/15 
minutes to find each image using only CL queries. Topics are general enough so that 
people unfamiliar with the collection can still perform the searches. Captions must 
also be translated into this language before being displayed (if at all) to the user. The 
aim of this experiment will be to observe users search habits and to determine what 
kind of interface best supports query refinement. For example the user is shown a 
picture of an arched bridge but starts with the query „bridge“. By finding similar im- 
ages and maybe using keywords from their captions, the user refines the query until 
the relevant image is found. Query. Topics and systems will be presented to the user 
in combinations following a Latin-square design to ensure user/topic and sys- 
tem/topic interactions are minimised. Qualitative performance measures is captured 
using questionnaires provided by us, and quantitative measures include: whether the 
given image is found or not, the time taken to find the image, the number of images 
viewed before finding the image and number of user interactions required. 

4 Conclusions and Future Work 

In this paper we have discussed our proposal for three cross-language image retrieval 
tasks as part of the ImageCLEF campaign. The tasks vary across domain, scenario, 
where CLIR is used, whether content-based image retrieval is required and whether 
the task is system or user-centered. Results from ImageCLEF 2003 have shown CL 
image retrieval to be a success, but large improvements can still be obtained for some 
languages (e.g. Chinese). Our aim is to promote CL image retrieval and provide a 
standardised set of resources in the form of test collections (i.e. a collection, topics 
and relevance assessments) which can be used in further CL image retrieval experi- 
ments. In future work we plan to expand the collections and tasks offered in Image- 
CLEF. In particular we would like to offer collections with non-English captions pro- 
vide a Web-based image retrieval task and offer further image retrieval tasks, e.g. 
aspectual retrieval. 
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Abstract. Our work in content- based image retrieval (CBIR) relies on 
content-analysis of multiple representations of an image which we term 
multiple viewpoints or channels. The conceptual idea is to place each 
image in multiple feature spaces and then perform retrieval by query- 
ing each of these spaces and merging the several responses. We have 
shown that a simple realization of this strategy can be used to boost the 
retrieval effectiveness of conventional CBIR. In this work we evaluate 
our framework in a larger, more demanding test environment and find 
that while absolute retrieval effectiveness is reduced, substantial relative 
improvement can be consistently attained. 



1 Introduction 

Content-based image retrieval (CBIR) has been the object of considerable study 
since the early 90’s. Much effort has gone into characterizing the “content” of 
an image by means of a variety of features for the purpose of indexing and sub- 
sequent retrieval. In earlier work [1] we proposed a strategy to capitalize on this 
work and to extend it by employing content-analysis of multiple representations 
of an image which we term multiple viewpoints [2]. The idea is to place each im- 
age in multiple feature spaces and then effect retrieval by querying each of these 
spaces and merging the several responses. The impetus for this research comes 
from work in text IR on combination of evidence strategies that dates back to 
the early 90’s. Two approaches have generally been used. In the first approach 
a diversity of queries is used to capture an information need more precisely. The 
several queries can be combined before searching, or issued individually and the 
results of each query merged afterwards. The work of Belkin et al. [3,4] adopts 
this approach. 

The second strategy is to use a diversity of representations, that is, create 
several indexes over the same corpus of documents. The typical strategy is to 
index the corpus with the same technology varying indexing parameters, or to 

* This material is based upon work done while serving at the National Science Fo- 
undation. Any opinion, findings, and conclusions or recommendations expressed in 
this material are those of the authors and do not necessarily reflect the views of the 
National Science Foundation. 



P. Enser et al. (Eds.): CIVR 2004, LNCS 3115, pp. 252-260, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 




Ail Empirical Investigation of the Scalability 253 



index the corpus with different technologies. Queries are processed in each set- 
ting with the results being merged afterwards. The work of Fox and Shaw [5] 
adopts this strategy. Bartell et al. [6] also look at combining evidence in this 
framework. The approach we adopt for extending CBIR systems to combine 
multiple evidence is analogous to this latter approach. 

These ideas are embodied in our synthetic retrieval model for CBIR[7] shown 
schematically in Figure 1. We refer to this as a synthetic retrieval model because 
we merge the various viewpoints and synthesize a channel for presentation to the 
user. We increase the number of viewpoints in CBIR systems in three different 
ways: multiple representations; multiple CBIR systems; and multiple queries. We 
also employ relevance feedback to further increase retrieval performance. Within 
this framework we have investigated the use of a diversity of representations 
that we call channels to achieve retrieval effectiveness gains over conventional 
CBIR[1,8,9,7]. Our approach is exogenous; we treat the CBIR system as a black 
box. We create additional channels by transforming the images and indexing the 
transformed images. Our four channels derive from the color positive (C+) and 
negative (C-) and the black and white positive (B+) and negative images (B-). 
Note that the C+ channel corresponds to the conventional CBIR system. 

In this paper we evaluate our framework in a larger, more demanding test 
environment. Due to page limitations we refer the reader to [1,8, 9, 7] for specific 
details of our framework. In the remainder of this paper we describe the current 
experimental setup and finally discuss our results. 
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Fig. 1. Synthetic Retrieval Model 



2 Experimental Setup 

2.1 Basic CBIR Technology 

We used a basic CBIR setup similar to that used in the MiAlbum system used 
in the work of Liu et al. [10]. Our system uses seven image features: three color 
features and four texture features. For similarity comparisons each feature was 
compared separately and then combined with equal weight. 
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2.2 Testbed 

Test Data. We used two different image collections in this work. 

1. D34: this is 3,400 images drawn from 34 categories of the COREL image 
collection. Each category contains 100 images. The categories were chosen 
because each of the images has a salient foreground object. 

2. D594: this is a larger version of the COREL database consisting of 594 
image categories each having 100 images each. Thus, D594 contains 59,400 
images. It should be noted that D34 is a proper subset of D594. 

Query Sets. We use three different query sets in this work. 

1. Q3400: Each of the images in D34 is used as a query. Thus Q3400 = D34 
and there are 3,400 query images. 

2. Q204: The 34 categories of D34 are uniformly sampled and 6 images are 
included in Q204 from each category. This is a 6% sample of D34 with 
equal representation of each category. Thus, there are 204 query images in 

Q204. 

3. Q3564: The 594 categories of D594 are sampled in the same way as Q204. 
Thus, this is a 6% sample of D594 with 3,564 queries and equal representa- 
tion of all categories in the sample. Note that Q204 is not a proper subset 
of Q3564 and has no specific relation to it. 

Ground Truth. In earlier work we used the “foreground” groundtrutlr for 
D34[l,8,9], but since we do not have the equivalent for the additional image cat- 
egories in D594, we have used a different but consistent “COREL” groundtrutlr 
in the work reported here. This latter groundtrutlr is defined to mean that all 
the images in an image category are relevant to all the other images in the cate- 
gory and not relevant to any other images. Thus, any image selected from a test 
collection to act as a query will have exactly 99 relevant images in the collection. 

Our earlier work has shown remarkable consistency between the performance 
as measured by these two groundtruths and we have never had one contradict 
the other in an experiment so we believe this choice to be adequate for our 
purposes. 

Indexing the Images. We created four indexes corresponding to each chan- 
nel in our testbed. The images were transformed into the representation of the 
channel and then indexed by our CBIR system. Thus, for each testbed we have 
a single corpus of images over which we have four separate indexes. 
Experiment Notation. We denote a particular experiment by Q/D where 
Q is the query set and D is the testbed data set. For example, Q204/D594 
denotes the 204 queries of the 6% sample of D34 processed against the 59,400 
images in the large data set. 

User Model for Relevance Feedback. In an earlier study [7] we observed a 
significant improvement in retrieval performance when using relevance feedback, 
that is, providing images identified as relevant in one iteration of the search 
to the query set in the next iteration. This is consistent with the results of 
other studies of relevance feedback. Our approach is to issue each feedback query 
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independently and then merge the results for presentation to the user. This leaves 
open the issue of how the feedback queries are chosen. We use two strategies: 

1. Identify the top k images (Top-fc); and 

2. Take k images at random from among the relevant images (Random-fc). 

The former strategy is customarily used in text IR experiments. However, the 
latter strategy seems more appropriate for CBIR given the relative ease with 
which a user may judge the relevance of images. We feel that the Random- fc user 
model more accurately reflects user behavior. Earlier work [7] has shown that 
this strategy will result in higher retrieval performance because it defeats self- 
similarity in feedback images and therefore achieves a greater visual diversity 
among the feedback images. Note that there are at most k images chosen by 
either strategy because in some cases fewer than k images are present in the 
retrieval result. Further, k = 8 in all the experiments reported here. 

2.3 Methodology 

The query processing is the same in all experiments. One query set is processed 
against one testbed. Each query is processed separately and the precision 1 at 100 
images seen (P100) is calculated for each. The average P100 is calculated over 
the entire query set and that result is reported. We note that P20 has become 
a very common metric for reporting results in text-based IR. A typical CBIR 
UI displays 30-50 thumbnails at a time in response to a query. Because of the 
ease of evaluating images for relevance relative to text documents, we feel that 
P100 is a more appropriate performance measure. One hundred images is also 
the first time we could conceivably achieve recall 2 of 1.0 for any of the queries 
in our query sets. 

Our merging results in [1,8,9] were produced using the combSUM[5,ll ] ap- 
proach, that is, we summed the similarity values for images across the channels 
in which the image was included in the response set. (The conditions set out by 
Vogt [12] for linearly combining relevance scores apply here: our channels do have 
reasonable performance and they do not rank relevant documents similarly.) We 
have also used a rank sum approach, midrank merge 3 , for merging and found 
that to perform comparably with combSUM. We use that technique here. 

3 Results 

The four plots show in Figure 2(a-d) each show five experiments, Q204/D34, 
Q3400/D34, Q204/D594, Q3400/D594 and Q3564/D594 respectively. The 

1 Precision is the ratio of the number of relevant images retrieved to the total number 
of images retrieved. 

2 Recall is the ratio of the number of relevant images retrieved to the total number of 
relevant images. 

3 Each channel assigns a rank to each image retrieved and not retrieved. The assigned 
ranks are summed to determine the image’s rank in the final result. When a channel 
does not retrieve an image, it is assigned a rank higher than 100. 
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plots are paired vertically by user model (Top -k, Random- fc) and are paired 
horizontally by channel configuration (one channel, four channels) . The Random- 
k user model is higher performing than the Top-/c model. We have observed 
this in earlier experiments [7] and attribute it to greater visual diversity in the 
feedback images. The topmost pair of lines show that the small sampled query 
set (Q204) has very similar performance to the larger query set (Q3400) in the 
smaller testbed (D34). The next two lines show that the small sampled query 
set also has very similar performance to the larger query set in the larger testbed 
(D594). This is consistent in all four plots. 

Conclusion 1: The smaller sampled query set, Q204 is representative of the 
larger query set. Q3400, as regards performance evaluation. 
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Fig. 2. Retrieval performance as measured by sample vs. all queries in small and large 
testbeds. 
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Table 1. Retrieval precision and performance increase after each feedback interation 
(Top-fc user model). Avg. precision small (large) testbed is 29.3% (61.8%). 



Q204 Q3400 Q3564 



Iteration D34 D594 D34 D594 D594 



0 

1 

2 

3 



.2109 

.2362 12.0% 
.2528 7.0% 
.2652 4.9% 



.0759 

.1003 32.1% 
.1126 12.3% 
.1209 7.4% 



.2032 

.2372 16.7% 
.2563 8.1% 
.2698 5.3% 



.0730 

.0966 32.3% 
.1102 14.1% 
.1190 8.0% 



.0413 

.0554 34.1% 
.0632 14.1% 
.0673 6.5% 



Total 



25.7% 



59.3% 32.8% 63.0% 



63.0% 



The 4-channel configurations (Figure 2c, d) are equivalent to the single chan- 
nel configurations (Figure 2a, b) initially but outperform them considerably in 
all feedback iterations. 

Conclusion 2: Relevance feedback in the multichannel configuration is more ef- 
fective than in the single channel configuration. 

The four plots of Figure 2 clearly show that absolute retrieval effectiveness 
(as measured by P100) is lower in the larger database (D594) as compared with 
the effectiveness observed in the smaller (D34). This occurs in both single and 
multichannel configurations . 

Conclusion 3: CBIR retrieval precision is substantially reduced when the size of 
the database is increased. 

However, even though the absolute effectiveness is reduced in the larger 
testbed, the rate of improvement with each feedback iteration is roughly con- 
stant. In addition the overall improvement in each configuration was also very 
stable, averaging 62%. Table 1 shows the actual values. This is perhaps the most 
important feature of the multichannel approach. 

Conclusion f: The multiple viewpoint techniques demonstrated in the smaller 
testbed (D34) are also effective in the larger testbed (D59f). 

Finally, the Q3564/D594 experiment has substantially lower performance 
than the Q3400/D594. Recall that all the queries in Q3400 come from 34 of the 
594 categories in D594 whereas there are 6 queries from each of the categories 
of D594 in Q3564. We hypothesize that Q3400 is therefore an “easier” query 
set than Q3564. 

Conclusion 5: Q3400 and Q3564 do not have similar retrieval performance in 

D594- 

The four plots in Figure 3 are grouped vertically by query set with the smaller 
query set (Q204) on the left and the larger (Q3400) on the right. They are 
grouped horizontally by testbed size with the smaller testbed (D34) topmost 
and the larger (D594) on the bottom. In each case four lines are shown corre- 
sponding to the two user models (Top-/c and Random- fc) and the two channel 
configurations (one, four). Again, the data support Conclusion 1. We are also 
led to the following conclusions. 




258 



J.C. French, X. Jin, and W.N. Martin 



-4 channels randomtop 



- 1 channel randomtop 



Feedback Interations 



query 204 in 3400 

(a) 



0.20 -f 
a- 0.15 
0.10 



-4 channels randomtop 



- 1 channel randomtop 



Feedback Interations 



query 3400 in 3400 

(b) 



0.20 

0.15 

0.10 

0.05 



-4 channels randomtop 



- 1 channel randomtop 



Feedback Interations 



query 204 in 59400 

(c) 



-4 channels randomtop 



-1 channel randomtop 



Feedback Interations 



query 3400 in 59400 

(d) 



Fig. 3. Performance of single channel vs. multichannel CBIR for two user models. 



Conclusion 6: The Random-k user model consistently achieves the highest pre- 
cision. 

This is good news especially since we consider the Random- fc user model to 
more accurately reflect user feedback selection in real CBIR environments. 

Conclusion 7: The multichannel configuration is superior to the conventional 
single channel approach achieving greater precision with each feedback iteration. 

Especially noteworthy is the fact that this improvement is attainable in the 
larger image testbed, 59% (63%) for the smaller (larger) query set. We also 
observed this in the more difficult large test environment (Q3564/D594) dis- 
cussed earlier where the query set is a 6% sample of the 59,400 images queried 
(see Table 1). The improvement (63%) was exactly the same as that achieved 
with the Q3400 query set (see Figure 3c, d). 
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The results clearly indicate that even in the more difficult testbed, we can, in 
fact, combine the channels, even naively, to realize retrieval effectiveness gains 
over the conventional single-channel CBIR approach. 

4 Conclusions 

We have described a simple approach for improving the retrieval effectiveness 
of conventional CBIR systems. Our approach treats the CBIR technology as a 
black box which can be used to provide different channels of retrieval results for 
subsequent merging or for use in interactive retrieval interfaces. The channels 
are implemented as additional indexes over simple image transforms. This offers 
a simple, cost-effective strategy for boosting the performance of CBIR systems. 

In [1] we showed that multichannel retrieval could increase CBIR retrieval 
effectiveness. We demonstrated an 8% increase in non-interpolated average preci- 
sion with a 4-channel configuration of our CBIR system over the baseline system 
when ranking all the images of our test database. The average non-interpolated 
precision increases by 22% in the 4-channel system when we consider result lists 
of the top 100 images. 

In [8] we looked at the potential for performance improvement when two 
CBIR systems were used to supply the viewpoints for constructing the synthetic 
channel. Again, the combination of multiple channels (this time from different 
CBIR systems) resulted in increased retrieval effectiveness. Moreover the com- 
bination of the two techniques, multiple systems and multiple representations, 
were complimentary and resulted in an even greater performance boost. 

In [7] we looked at multiple queries as a means of achieving greater retrieval 
performance by providing more exemplars of the user’s information need. We 
introduced the concept of visual diversity and examined the role of multiple 
representations and multiple CBIR systems in achieving visual diversity to im- 
prove retrieval with multiple queries. We also examined several strategies for 
accommodating relevance feedback in our synthetic framework. A new feedback 
evaluation strategy was also proposed and shown to be more effective because 
it increases the visual diversity of the feedback images. 

In this paper we extended our work to validate our approach in a larger, more 
demanding retrieval environment. This kind of study is hampered somewhat by 
the lack of suitable testbeds with associated groundtruth. Our work here is 
extended to a testbed of 59,400 images from our earlier testbed of 3,400 images. 

This study confirmed our earlier finding that the multiple viewpoint tech- 
niques, singly and in combination, improve retrieval effectiveness of CBIR sys- 
tems even in more demanding retrieval environments. We found that although 
the absolute precision was reduced, the rate of improvement held up well in 
feedback iterations and the overall relative improvement after three feedback 
iterations was approximately 62%. 

Another finding is that we can get extremely accurate performance evaluation 
with smaller query samples. This will enable us to conduct empirical studies more 
efficiently with confidence in the results. 
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Note that the techniques proposed here do not increase the user work in rele- 
vance feedback. The user is only concerned with the synthetic channel presented 
after each feedback cycle. The system transparently feeds the selected images 
back to all the underlying channels and merges the several results back into a 
synthetic channel for the user. 

The synthetic retrieval framework also makes parallelization a simple matter. 
Retrieval on each channel is independent of all others. Thus, each channel can 
be assigned to a separate processor for query processing followed by a merging 
stage. This strategy can provide high retrieval efficiency in addition to improved 
retrieval effectiveness. 
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Abstract. In this paper, we introduce a real-time metadata service system that is 
implemented for live digital broadcast TV programs. The system is composed 
of three parts: an indexing host which indexes broadcast programs in real-time, 
a broadcaster where the segmentation metadata delivered from the indexing 
host is multiplexed into the broadcast stream and transferred to clients, and a 
client PVR that receives the metadata and locates a segment of interest from the 
recorded stream according to the time description of the delivered metadata. We 
propose to utilize broadcasting time for a time description of the segmentation 
metadata, so as to be free from the media localization problems in broadcast 
environment. In addition, we utilize a spatiotemporal visual pattern of a video 
for a verification tool of real-time indexing, such that we can reduce the false 
alarms of video segmentation caused by lack of an efficient tool for verifying 
video segment. As a result, we show the real experiments that are performed 
without requiring a return channel and demonstrate the feasibility of the 
proposed system. 



1 Introduction 

Recently, digital set-top boxes (STBs) with local storage known as a personal video 
recorder (PVR) begin to penetrate TV households. With this new consumer device, 
television viewers can record broadcast programs into the local storage of their PVR 
for viewing later. Due to the nature of digitally recorded video, viewers now have the 
capability of directly accessing to a certain point of recorded programs in addition to 
the traditional controls such as fast forward and rewind. Furthermore, if a 
segmentation metadata for a recorded program is available, the viewers can browse 
the program by selecting some of predefined video segments within the recorded 
program and play highlights as well as summary of the recorded program. 

The metadata can be described in proprietary formats or in international open 
standard specifications such as MPEG-7 [1] or TV-Anytime [2]. The media location 
used in typical metadata such as TV-Anytime format are usually described by using 
either byte offset specifying the number of bytes to be skipped from the beginning of 
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Ihe file or media time specifying a relative time point from the beginning of the file. 
However, it might be ambiguous to describe a specific position of a broadcast stream 
using media time or byte offset, since it is hard to clearly identify when or where a 
program starts within the broadcast stream in which a number of programs or 
commercials are multiplexed and that is continuously being streamed without a 
program boundary marker through the broadcast network. 

One possibility for random access to a specific position of broadcast streams is to 
use MPEG-2 DSM-CC Normal Play Time (NPT) [3] that provides a known time 
reference to a piece of media. For applications of TV- Anytime metadata in DVB- 
MHP broadcast environment, it was proposed that the NPT should be used for the 
purpose of time description [4, 5], In the proposed implementation, however, it is 
required that both indexing system and client PVRs can handle NPT properly, thus 
resulting in highly complex controls on time. 

Another possibility is to use the MPEG-2 Presentation Time Stamp (PTS) which 
indicates the time that a presentation unit is presented in the system target decoder. 
However, it requires parsing of packetized elementary stream (PES) layers, and thus it 
is computationally more expensive. Further, if a broadcast stream is scrambled, the 
descrambling process is needed to access to the PTS. Moreover, most of digital 
broadcast streams are scrambled, thus an indexing system cannot access the stream 
without an authorized descrambler if the stream is scrambled. 

From a practical point of view, we propose to use broadcasting time as reference 
time, which is the simplest and most cost effective way of describing time index 
within a broadcast stream comparing to the above methods that require the 
complexity of implementation of DSM-CC NPT in DVB-MHP and computational 
cost and descrambling problems of PTS. Broadcasting time is carried on the broadcast 
stream in the form of system time table (STT) of ATSC [6] or time date table (TDT) 
of DVB [7]. Using broadcasting time as reference time does not require for an 
indexing system and client PVRs to be connected for synchronization through an 
interactive communication channel such as Internet. Also, it provides an efficient 
method to locate same position of the broadcast stream in both side of indexing 
system and client PVRs since the STTs or TDTs are contained in its temporal position 
of the broadcast stream according to the broadcasting time. For example, STT of 
ATSC is repeatedly broadcast once every second. 

Fig. 1 shows the overall structure of proposed system composed of an indexing 
host (real-time indexing system: RT1S), a broadcaster, and a client PVR. A 
segmentation metadata for a live broadcast program is generated at the indexing host 
and delivered to the client PVR through the broadcasting network. The detailed 
descriptions will be shown in the following sections: the section 2 shows the detailed 
description of methods used in RTIS for the media localization and real-time 
segmentation, the section 3 presents the implementation of the test-bed and the 
experimental results, and the section 4 concludes the paper. 



2 Media Localization and Real-Time Segmentation 

We encounter two problems in implementing the proposed real-time metadata service 
scheme. One is how to localize the broadcast stream with broadcasting time in both 
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Fig. 1. Overall structure of test-bed for the real-time metadata service scheme 

sides of the indexing system and the client PVRs. Another is how to index a live 
broadcast program in real-time, that is, how to detect shot boundaries (or scene 
changes) and group the shots into the segments of interest and how to easily verify the 
detected shot boundaries in real-time. The other problem is how to deliver the 
segmentation metadata to user’s PVR in broadcast stream. 



2.1 Media Localization Using Broadcasting Time 

To solve the media localization problem in broadcast environments, we use the 
broadcasting time carried on STT or TDT of the broadcast stream in both sides of the 
indexing system and the client PVRs due to the convenient features of it as described 
in above section. 

In the indexing system RTIS of Fig. 1 , the timestamp mixer is introduced to index 
a digital broadcast stream with broadcasting time regardless of whether the stream is 
scrambled or not. The timestamp mixer superimposes the visual timestamp, such as a 
structured color-code [8], showing the current broadcasting time onto each frame of 
broadcast stream received through the tuner. The visually time-stamped analog output 
signals of the timestamp mixer are then encoded in low bit-rate at the real-time 
indexer. Using the stream encoded in low bit-rate, we can avoid a possible problem of 
directly accessing scrambled broadcast stream as well as a burden of indexing very 
high bit-rate stream such as HDTV broadcast stream. 

In order to superimpose the timestamp for the current broadcasting time, the 
timestamp mixer examines broadcasting time carried on the STT or TDT of the 
received broadcast stream via its tuner. 

In case of ATSC, it is recommended that I-frames shall be sent at least once every 
0.5 second in order to have acceptable channel-change performance. Further, there 
exists a delay between the arrival time of a frame and its presentation time due to the 
VBV delay with maximum delay time of 0.5 second and decoding time delay. Fig. 2 
shows an example of indexing and accessing the start position of a segment specified 
by the broadcasting time BT based on the above properties of ATSC. 

The broadcasting time BT carried on the STT or TDT is represented with a discrete 
second unit. Thus the frames presented on screen during a discrete second have the 
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same broadcasting time with which they are time-stamped with the same broadcasting 
time 

When the real-time indexer indexes the re-encoded video resulting from timestamp 
mixer, it extracts the broadcasting time for each video frame from the timestamp 
superimposed onto the frame. The extracted broadcasting time represents the current 
broadcasting time of the frame, at which the frame is presented on screen. For 
example, in case of frame A in Fig. 2(a), the broadcasting time of frame A is BT n 
when the frame is displayed on screen on which the broadcasting time BT n is time- 
stamped. Whereas in case of frame B, the broadcasting time of frame B is BT n+I 
although the frame is arrived at previous time of BT n , because the frame B is 
displayed on screen at BT n+I with which the indexing system indexes the frame B. 

Let PTS(a) and PTS(l\ ) denote the PTS value for the first frame a of a segment 
S a presented at the broadcasting time BT n and for the first I-frame since BT n , 
respectively. Then, the time difference TD(Sa) is defined as: 

TD(S a ) = PTS(a) - PTS( \' n ). (1) 

In Fig. 2(a), the time difference TD( S A ) for the segment S A displayed at BT n has a 
positive value because the PVR will display the video starting from ij, including the 

segment S A as shown in Fig. 2(c). However, the time difference TD(S B ) for the 
segment S B displayed at BT n+l has a negative value because the client PVR will 

display the video starting from the first I-frame lj !+1 since BT n+1 which results in 
missing frame B that is desired to be presented as the first frame of the segment S B . 

Therefore, when we display a segment whose start time is BT n , we propose that 
l| 2 _j , which precedes lj ; with a broadcasting time unit (BT n - BT nI : one second in 
case of STT) from BT n , should be used to avoid missing the first frame of the segment 
we use. 




(b) 



Metadata 



BT n :B roadcasting Time of A 
BT n+1 : Broadcasting Time of B 



[J I-frame | start frame of 



a segment 



(c) Client Recorded 
PVR < Stream 



ii-. 






Fig. 2. An example of the media localization: (a) The indexing system indexes broadcast 
stream with the broadcasting time BT. (b) The generated metadata is described the 
broadcasting time, (c) The client PVR locates the start position of the segment by the 
broadcasting time. 
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2.2 Real-Time Segmentation Using Spatiotemporal Visual Pattern 

Several approaches [9-11] have recently been proposed for an automatic video 
indexing by analyzing video, audio and closed caption. However, with the current 
state of art technology on image understanding and speech recognition, it is still hard 
to accurately detect highlights and generate a meaningful metadata in real-time. 

In order to index a broadcast program in real-time, an operator might have to watch 
carefully the current broadcast program and manually determine the start and/or end 
times of events before a broadcast program ends. The event is usually composed of a 
shot or a set of subsequent shots many of which might be automatically detected by a 
suitable algorithm with false alarms and missing shots due to editing effects such as 
zooming in/out, fading, dissolve, and wipe. To get the exact time information of the 
events, the operator might have to verify the result of automatic algorithm by playing 
back suspicious segments repeatedly, which will take lots of time. Thus, in order to 
overcome such problems and quickly index the live broadcast program, we need a 
new tool for easily verifying shot boundaries. 

A spatiotemporal visual pattern called Visual Rhythm [12] also known as spatio- 
temporal slice [13] provides an efficient way of verifying video segments, which is a 
two-dimensional abstraction of the entire three-dimensional content of the video. 

The most distinguished feature of the visual rhythm is that different video effects 
including edits and others such as cuts, wipes, zooms and camera motions manifest 
themselves as different visual patterns on the visual rhythm, as shown in Fig. 3. Due 
to the features, an operator can find out missing shot boundaries, for example, the 
wipe in shot n in the right side of Fig. 3, which might not be detected by the automatic 
scene change detection. The operator divides manually the shot n into two shots, shot nI 
and shot n2 so as to determine the segment boundary of segment m and segment mtr 

Therefore, inclusion of the visual rhythm in user interface of the real-time indexing 
application aids an operator to easily and quickly identify segment boundaries as well 
as visual rhythm itself might be used as a primitive material for automatic shot 
detection. 




segment m 1 segment m+] 



Fig. 3. (a) VR extraction from the video V. (b) Editing effects presented in VR. 



2.3 Metadata Delivery 

One way to describe segmentation metadata is by utilizing international standards on 
metadata specification such as MPEG-7 or TV-Anytime. The MPEG-7 or TV- 
Anytime metadata can be multiplexed into MPEG-2 transport stream that is broadcast 
to clients through broadcasting network. There might be several solutions of 
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delivering the standard metadata to clients through broadcast stream: defining a new 
MPEG-2 private section or descriptor, using the DSM-CC sections, or specifying new 
type of MPEG-2 PES. 

These approaches have two inherent problems. First, the segmentation metadata 
generated based on the metadata standards are often large in size and thus occupies 
non-negligible amount of bandwidth for data broadcasting that the current DTV 
service providers want to minimize. Second, it will take much time for the approaches 
to be realized because they will require many changes or adoption of existing or new 
software and hardware components in existing broadcasting environment. 

Therefore, a new technique is needed to deliver the segmentation metadata that is 
smaller in size compared to segmentation metadata based on MPEG-7 and TV- 
Anytime, through the existing broadcasting environment. 

In the proposed system, instead of defining new field for the segmentation 
metadata, we adopt the existing EPG (Electronic Program Guide) as a carrier of the 
segmentation metadata because it could be used without any modification of 
broadcast equipments. That is, we utilize the field for detailed description (synopsis) 
of a program in EPG data structure. Since the detailed description of a program is 
presented in the viewer’s screen, we have designed new compact metadata format to 
be legible and informative for viewers who do not have metadata browsing modules 
only ported on our test-bed client PVR. In table 1, the syntax of the segmentation 
metadata is represented according to BNF (Bacchus Naur Form) grammar, and one 
example used in our test-bed is given. The size of the example metadata in table 1 is 
only 239 bytes whereas the TV-Anytime format for the metadata requires more than 
5K bytes for same segmentation information. Due to the small size, we can carry it on 
the detailed description of a program in EPG which is practically restricted in size of 
250 bytes in our test-bed. 

Table 1 . BNF grammar for the our segmentation metadata format and the example. 



BNF grammar for metadata format 


Example 


<segment info> ::= <title> <segments> 

<title> ::= <string> LF 

<segments> ::= <segment> <segment> <segments> 
<segment> ::= <segment locator> SP [<segment title>] LF 
<mediaJocator> ::= <2digit> <2digit> <2digit> 
<segment title> ::= <hierachical sequence> SP <string> 
<hierachical sequence> ::= '<’ <sequenee number> ‘>’ 
<sequenee_number> ::= DIGIT 

DIGIT V <sequence_number> 
<string> ::= CHAR CHAR <string> 

<2digit > ::= DIGIT DIGIT 


Survival English SEP 4 
06:20:41 <1> Introduction 
06:22:49 <2> Today's Dialog 
06:23:19 <3> Dialog Part I 
06:27:04 <4> Dialog Part II 
06:3 1 :00 <5> Dialog Part III 
06:33:08 <6> More Expressions 
06:34:25 <7> Review Dialog 
06:35:48 <8> Help Me 



3 Implementation and Experimental Results 



Real experiments with ATSC terrestrial HDTV programs are performed by porting 
our software into a commercially available PVR. The scenario we have implemented 
is as follows. Firstly, we index a broadcast program in real-time and immediately send 
the resulting metadata to a broadcast station. 
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Secondly, the delivered metadata of the program is inserted into the field for 
synopsis of the program in EPG that is transmitted to client PVRs through the 
broadcasting network. 

Finally, the client PVR detects the EPG update and retrieves the metadata of the 
program in the delivered EPG. The client PVR then locates a segment of interest from 
the recorded stream according to the broadcasting time described in the delivered 
metadata. Thus, the client PVR user can browse the recorded program through 
functionalities such as segment play/replay and random access to the segment of 
interest. 



Header 

Channel 

Number 



Broadcasting 

Time 




Fig. 4. The timestamp and the real-time indexer using spatiotemporal visual pattern. 



The RT1S is composed of a real-time indexer (personal computer) equipped with 
an encoder for low bit-rate encoding, and a timestamp mixer that is a STB including 
timestamp generator. We implemented the timestamp mixer by programming the 
timestamp generator module and then porting it onto the commercially available 
PVR. Fig. 4 shows the example of the timestamp represented with structured color- 
code [8] superimposed onto the frame, and the screen shot of indexing application 
that indexes broadcast program in real-time using the visual timeline called the visual 
rhythm shown in the top of the application. 

For the client PVR in our test-bed, we have utilized a commercially available PVR 
that is a HDTV STB with a 40G Bytes of HDD, on which we developed our 
applications. One application is responsible for retrieving the metadata contained in 
the EPG: checking the EPG update, extracting the metadata of a recorded program 
from the EPG, and storing the metadata onto the storage. The other application is 
related with browsing the recorded program with the retrieved metadata: locating a 
video segment of interest, extracting key frames (thumbnail images) which are used 
for user interface for browsing window, and managing graphic user interface. 

For the experiments, we indexed an educational program that was broadcast at 6:20 
AM in Korea. We indexed the program while it was being broadcast using the real- 
time indexer as shown in the right side of Fig. 4. The indexing process was finished at 
a minute after ending time of the program. We manually verified the segmentation 
results, and then generated the metadata such as shown in Table 1. Immediately after 
generating the segmentation metadata, we sent the metadata through email to an 
operator who is responsible for updating EPG of a broadcaster. The operator then 
updated the detailed description (synopsis) of the program using EPG builder with the 
received metadata. It took some minutes because the operator had to check his email 
and copy the metadata script and then paste it on the input field of the detailed 
description of the program manually. Finally, after applying the EPG update in the 
broadcaster, the metadata was transmitted or broadcast through the broadcast network 
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frame number 




1 26 51 76 101 126 151 176 

frame number 



(a) (b) 

Fig. 6. (a) The time difference by (1). (b) The time difference of proposed method. 



and finally received by the client PVR that eventually extracted the metadata. In the 
experiment, it took about 5 minuets from the ending time of the program to the time 
of receiving the metadata on the client PVR. This time delay is mainly due to the 
manual works for sending an email and updating EPG. If we have an interactive 
channel between the EPG builder and our real-time indexer and the update of EPG 
can be controlled by software, we could reduce most of the time delay. Thus, PVR 
users can browse a recorded program with corresponding metadata just after the 
recording is finished. 

Fig. 5 shows the resulting TV screen displayed in PVR when we browse the 
recorded program with the delivered metadata. The key frame shown in the left of the 
screen is the image extracted from the recorded stream in PVR by using the 
broadcasting time described in the delivered metadata. 

In our experiment, we observed that the first part of a segment was often missed. 
To see how much time difference was occurred, we measured the time difference (1) 
with broadcast stream as shown in Fig. 6(a). Negative values of the time differences 
in Fig. 6(a) indicate that the video was started playing after the absolute time 
difference from the desired starting time position of the segment. On the other hand, 
the time difference of the proposed scheme has no negative value as shown in Fig. 
6(b) since we subtracted one second (based on ATSC STT) from the broadcasting 
time described in the metadata to avoid missing frames when we implemented the 
browsing module onto the PVR. 
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4 Conclusion 

We have introduced a real-time metadata service scheme and implemented a test-bed 
having an indexing host, a broadcaster, and a client PVR. For the service scheme, we 
have proposed a novel method of indexing the broadcasting program in real-time, 
which is to utilize broadcasting time that is carried on the broadcast stream itself. 
From the experiments, we could show that the method could be applied to the current 
digital broadcast environments without changing any software and hardware 
components. Moreover, it was very beneficial demonstration for digital broadcasting, 
in the point of real-time metadata service for live broadcast program. 
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Abstract. People as news subjects carry rich semantics in broadcast news video 
and therefore finding a named person in the video is a major challenge for video 
retrieval. This task can be achieved by exploiting the multi-modal information 
in videos, including transcript, video structure, and visual features. We propose 
a comprehensive approach for finding specific persons in broadcast news 
videos by exploring various clues such as names occurred in the transcript, face 
information, anchor scenes, and most importantly, the timing pattern between 
names and people. Experiments on the TRECVID 2003 dataset show that our 
approach achieves high performance. 



1 Introduction 

The dramatic increase of digital videos demands more efficient and accurate access to 
video content. Content-based analysis and retrieval has been extensively used for 
video segmentation [2], video retrieval [3], and image retrieval [1]. As discussed in 

[4] , finding a specific person in videos is essential to understand and retrieve videos. 
Although solving this problem might be difficult for general videos, in this paper we 
target at very specific content namely broadcast news video. Since news videos are 
strongly related to human subjects, finding "person X " is an important and frequent 
challenge. Taking advantage of the multimodal content in videos, we propose a 
people-finding approach which exploits name occurrence in transcript, video 
structure, and visual information such as faces and news anchor scenes. Specifically, 
this approach utilizes a timing model to overcome the temporal offset between names 
and persons, which will otherwise compromise performance. Our approach was 
developed and evaluated using the dataset from TREC 2003 Video Track (VIDTREC) 

[5] , which is divided into a training set (FSD) and a testing set (FST), each consisting 
of over 100 hours of ABC, CNN, and C-SPAN news video. 



2 Transcript Search with Timing-Based Score Propagation 

An essential clue for finding a person in the broadcast news video is the mention of 
his/her name in the transcript, acquired either from a speech recognizer or from closed 
captions. This clue indicates that this person is likely to appear visually. We do not 
address the rare cases where a person appears without his/her name being mentioned. 
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In this section, we discuss using transcript to find and rank video shots that contain 
specific persons. Here a video shot is defined as an unbroken sequence of frames 
taken by one camera and it serves as a basic structural unit in our video retrieval. 



2.1 Basic Transcript-Based Search 



Since the transcript is temporally aligned with the video, each shot is associated with 
a portion of the transcript that falls within its boundary. Therefore, an intuitive way to 
finding a specific person in video is to use text-based retrieval techniques to find the 
shots which contain the name. Specifically, we employ the TFIDF retrieval method 
[6], which gives the similarity between a shot S and a person named X as: 



R(X,S) = 



X tfi Tog 

El 




( 1 ) 



where (/■ is the frequency of term t t (as a part of the name X) in the transcript of shot 
S, N is the total number of shots, and «■ is the number of shots whose transcript has t-. 



2.2 Modeling Timing Between Names and Persons 

The method above is subject to a severe problem: it is not necessarily the case that a 
person appears in the video concurrently with the name mentioned in the transcript. 
Based on the statistics we have collected, in more than half the cases, a person does 
not show up in the shot where the name is mentioned, but before or after that shot. 
Undoubtedly, this mismatch seriously compromises the performance of text-based 
shot retrieval, which explores only the shots containing the person's name. 

The timing between visual appearances (i.e., face) and occurrences of a name is 
related to the "video grammar" of broadcast news. In a typical news story, an 
anchorperson briefs the news at the beginning, followed by several shots showing the 
news event and sometimes interviews and reporters. The name of a human subject in 
the news is normally first mentioned by the anchorperson, while his/her face is not 
always shown at that time. In the following shots, this person may appear several 
times in the video, roughly interleaved with occurrences of the name in the transcript. 
However, there are also cases where a person not mentioned by the anchorperson later 
appears in the shots, with or without his name mentioned in close proximity. 

Generally, no simple pattern is able to capture the possibility of such timing, but it 
is still true that a person is more likely to appear in the (temporal) proximity where his 
name is mentioned. Loosely speaking, the closer is the shot to name occurrence, the 
more likely it contains the person's visual appearance. As an example, we collected all 
the visual appearances of "Bill Gates" in FSD, and plot in Fig. 1 the frequency of these 
appearances at each quantized distance from their closest occurrence of his name. The 
distance is measured in terms of time or shot offset (number of shots between). The 
"0" point on the distance axis is where the name is mentioned, and positive distance 
means that a person appears visually after the name is mentioned. 

Based on Fig.l, it is intuitive to model the frequency of a person's visual 
appearance w.r.t his name occurrence using a Gaussian model. For a specific person, 
we estimate a Gaussian distribution from the distances from each of his visual 
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Fig. 1. The frequency of Bill Gates' visual appearances associates with name occurrences, and 
the Gaussian curves capturing the frequency distribution. 





(b) Distance in shot offset 



appearances in FSD to the closest name occurrence, both of which are manually 
labeled, using maximum likelihood estimation. Again, the distance is measured in 
terms of time or shot offset. In Fig.l, we superimpose the curves of the estimated 
Gaussian distributions for "Bill Gates", which nicely capture the shape of the bins 
showing the frequencies. 

Totally 20 persons are selected for study, varying from frequently appearing ones 
like "Michael Jordan" to rare ones like "Alan Greenspan". Table 1 shows the number 
of visual appearances of each person in FSD and FST respectively. The mean and 
standard deviation of the Gaussian distribution of each person estimated on FSD is 
ploted in Fig. 2 (a) for time-based distance and in Fig. 3 (b) for shot-based distance. 
People are ordered from left to right in descending frequency of their visual 
appearance in FSD. A global distribution computed from a pool of the training data 
from all the people is shown alongside. 



Table 1. The 20 people studied and the number of their visual apperances in FST and FSD. 



Name 


Lewinsky 


Jordan 


Yeltsin 


Starr 


Albright 


Ginsburg 


Pope 


Mccartney 


Gates 


Diana 


FSD/FST 


53/44 


47/75 


40/ 10 


37/35 


30/40 


28/22 


29/45 


26/ 10 


22/ 19 


12/7 


Name 


Malone 


Netanyahu 


Kendall 


Hillary 


Arafat 


Kohl 


Greenspan 


Suharto 


Jiang 


Laden 


FSD/FST 


11/19 


7/42 


6/3 


6/12 


3/33 


3/6 


2/6 


2/20 


2/19 


0/26 



As shown in Fig. 2 (a), for the first 9 people on the left, each of who appears 20+ times 
in FSD, the estimated distributions have similar mean values (1-3 sec.) and moderate 
standard deviations (3-6 sec.). This suggests that the Gaussian assumption is 
reasonably good for these people, and their distributions are similar to each other. 
Therefore, on average a person appears about 2 seconds after his name is mentioned 
in the "grammar" of news video. For the people with less than 20 appearances in FSD, 
however, the estimated distributions differ significantly: the mean varies from -2 to 14 
seconds, and the standard deviation can be as large as 12 seconds. But it is not fair to 
say that each infrequent name has a unique distribution, since our observation is 
biased by the insufficient training data in FSD used to estimate their distributions. We 
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will explore this question further in our experiments. The same trend is observed in 
the shot-based distributions in Fig. 2 (b). 




(a) time-based distance 
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(b) shot-based distance 



Fig. 2. The mean and standard deviation of the Gaussian distributions for each person 



2.3 Search Methods with Score Propagation 

Given the timing information, it is obvious that the basic transcript-based search can 
be improved by propagating the similarity scores from the shots containing the 
intended person's name to the neighboring shots in a window. The propagation is 
carried out as: 

R p (X,S)= £ f (S ' s i )*(*> S i ) (2) 

\S— S f |<W 

where w is the size of the window measured either by time or by shot offset, and f(S, 
Sj) is a weighting function with output within (0, 1), which decides the score being 
propagated to neighboring shots. The summation traverses all the shots S, that are in 
the neighborhood of S and have the intended name in the transcript. 

The weighting function f(S, Sj) can take many forms, depending on the design 
decisions made along the following dimensions: 
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• Flat window or weighted (Gaussian) window. In a flat window, /(.S', Sj) is a 
constant and all the shots in the window are propagated with the same score. In a 
weighted window, however, the score propagated to each shot is determined by its 
probability of containing the person's visual appearance, which is calculated from 
the density function of a Gaussian distribution. In this case, f(S, S') is 

end ^3) 

f(S,S i )= jV(M,<r 2 ) 

start 

where start and end are the starting and ending position of S in relation to S t 
(which has the intended name), and N(u,o 2 ) is the density function of the 
Gaussian distribution. 

• Time-based or shot-based distance measure: This decides whether to use a time- 
based Gaussian model n‘ x (u, a 2 )or a shot-based one n x (u,<j 2 ) ■ This makes a 
difference since the shot length differs a lot, and it is unclear which measure is 
more desirable as to revealing the relationship between a person's visual 
appearance and the name occurrence. 

• Local , global or combined Gaussian distribution : To search for a person, we can 
use the local Gaussian distribution trained particularly for this person N x (u,a 2 ) > 

the global distribution trained on all the people N G (u.a 2 ) , or a combination of 
them n c (u , a 2 ) • Intuitively, if each person has a unique distribution and there is 
enough training data, the local (people-specific) model is more desirable; otherwise 
the global one is better. The combined model uses a distribution integrated from 
both the local distribution and the global one. Inspired by the smoothing techniques 
used to overcome the sparse training data problem in information retrieval [8], this 
model "smoothes" a person's local distribution estimated from insufficient data 
with the global distribution. Specifically, the probability density function of the 
combined distribution is a linear combination of that of the local and the global 
distribution, where the weight is determined by the amount of training data 
associated with the person. It is formulated as: 

N c = aN x + (i-a)Ng and « = sigmoid (^- - y) ^ 

where a is the weight computed from the number of training data T x for person 
X, and ft and y are constants, which are set to 10 and 1 as determined by our 
informal experiments. According to the property of sigmoid function, a 
approaches 1 when T x increases, and vice versa (e.g., a = 0.5 when T x — 10, and 
a = 0.88 when 7 A ,=30). Therefore, the more training data we have observed, the 
more the combined distribution is determined by the local distribution. 



3 Face Searching and Anchor Filtering 

Visual information provides valuable clues for finding a person in news video. Unlike 
text information which roughly estimates where a person is, visual information can 
tell the exact position and time of the person's appearance. Face recognition 
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technology can match a person's face visually and predict its identity, though its 
performance is significantly affected by pose and illumination variances. Another 
important visual clue comes from the anchor detection, since people as news subjects 
seldom occur during the anchor shot. 

We apply the well-known Eigenface algorithm [9] for face recognition. Faces are 
collected using a face detection system [10], converted to gray levels and normalized 
to a standard size. Principal component analysis (PCA) is performed to construct 
Eigenfaces, which encode the most distinguishing parts of faces while ignore similar 
parts. The Eigenface representation has been shown to be a fairly robust approach to 
face recognition. However, it also has several drawbacks and the most serious one is 
pose variations, as non-frontal faces usually have much poorer recognition results 
than frontal ones. Lighting conditions present another serious problem. In broadcast 
news, due to the large variations in news footage, both the pose and lighting condition 
of faces vary largely, resulting in unreliable face recognition. 

To avoid the face recognition difficulties, we first use the trustworthy text 
information to find some shots as initial results, and apply face recognition on them to 
obtain additional clues for refining the initial results. In this way, the number of faces 
to be recognized is largely reduced and the accuracy can be improved. To address the 
wide variance on pose and lighting conditions, we find external images that contain 
the target face with varied conditions and use them as examples to recognize relevant 
faces. The (internal) faces to be recognized are extracted from the i-frame of the shots 
to be examined. Let the external Eigenfaces be denoted as {FI, F2, F3 Fn } and 
the internal Eigenfaces be denoted as { fl , f2, f 3 .... fin ). By matching every internal 
face with a specific external face Fj based on Eigenface, we obtain a ranking of all 
internal faces ordered by descending similarity to Fj. The final rank of an internal face 
is combined from its ranks with all the external faces, given as: 




where Rjfffi denotes the similarity rank of internal face f) with external face F r and 
R(fj) denotes the final rank of f). Since the external faces provide variances in pose 
and lighting condition, the final rank gives us a more robust prediction. Since a shot 
may has more than one i-frames, we average the rank of the face on every i-frame of 
the shot to get the score indicating how likely the shot contains the target face. More 
details of our face recognition method can be found in [11], 

The inclusion of anchor detection assumes that anchors seldom co-occur with a 
news subject person. We have built an anchor detector [3] based on multimodal 
classification that combines three information sources: the color histogram from 
image data, speaker ID from audio data, and face information from face detection. 
Face information contains the position, size and detection confidence of faces. 
Fisher’s Linear Discriminant (FLD) is applied to select distinguishing features for 
each source of information. Selected features are synthesized into a new feature 
vector of each shot, and the classification is performed on these feature vectors. 

The final prediction of the appearance of the target person is made by linearly 
combing the results of text-based search, anchor detection and face recognition: 

P(S) = ff T prior (s)+ fi Anchor(S) + jP(s) 



(6) 
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where a, /3 and y are weights for the three predictions, which are trained on a held- 
out set from FST (as FSD has been used to train to distribution). 



4 Experiment Results 

Experiments in finding the 20 selected persons in the TRECVID 2003 collection are 
conducted to determine the best people-finding method among those proposed in 
Sect. 2.3. Firstly, we compare the performance of the basic transcript-based search 
method without score propagation (denoted as Baseline ), the method with flat- 
window propagation ( Flat_Win ), the one with shot-based Gaussian propagation using 
the local distribution estimated from FSD ( Shot_Gauss_Local ), and its time-based 
counterpart ( Titne_Gauss_Local ). For each person, we use each method to find the 
shots in FST that contain his/her visual appearance and compute the mean average 
precision (MAP) [7] of the results. Note that the propagation window sizes in each 
method have been fine-tuned based on the FSD data. 




Fig. 3. Performance comparison of three propagation methods with baseline method 

As shown in Fig. 3, in all the 20 queries at least one propagation approach 
outperforms the baseline, and for 15 queries among them, all the three propagation 
approaches outperform the baseline. This suggests that score propagation based on 
timing information can greatly help the task of people-finding. Moreover, in 17 out of 
the 20 queries, the time-based Gaussian approach is the best performer, whose 
average MAP (0.40) is much higher than that of the flat-window approach (0.29) and 
shot-based Gaussian approach (0.28). Thus, time-based Gaussian is a better 
propagation strategy than the other two, implying that time is a better distance 
measure than shot offset w.r.t. revealing the timing between names and people. 

Fig. 4 shows the average MAP (over 20 queries) of the time-based Gaussian 
method using local, global, and combined distribution respectively, in comparison to 
that of baseline and flat-window approach. As shown, the approach with combined 
distribution outperforms the global one by 2%, which beats the local one by another 
2%, and all are about twice the performance of the baseline approach. 

The three types of distribution cause more interesting discrepancy on the 
performance of finding frequently occurring people versus that of finding infrequent 
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□ Baseline 

□ Flat_Win 

□ Time_Gauss_Local 

□ Time_Gauss_Global 

□ Time_Channel_Global 




Frequent narres 



Infrequent names 



Fig. 4. Performance comparison of local, global, combined distribution with visual information 

ones. Here frequent people are those who appear visually 20+ times in both FSD and 
FST (cf. Table 1), while infrequent ones are those appearing 20- times in both FSD 
and FST. By this standard, there are 7 frequent and 8 infrequent people among the 20 
people, while 5 people cannot be clearly classified due to their unbalanced 
appearances in FST and FSD. As we can see, for frequent names the choice of 
distributions does not have any significant influence on the performance, while for 
infrequent names the difference is substantial. Specifically, for all the 7 frequent 
people, the MAP of global distribution never differs from that of local distribution by 
over 10%, while for 5 out of the 8 infrequent ones, global distribution enhances the 
MAP by over 20%. This echoes our observation in Sect. 2.1 that the distribution of 
frequent names is similar to each other and thus to the global one, which is dominated 
by the dense training data of frequent people. Therefore, the performance of finding 
such people is almost unaffected by the choice of distribution. For infrequent people, 
since their local distribution is poorly estimated using their insufficient training data, 
the performance can benefit from using the more stable global distribution. It is 
interesting to see that the combined distribution is better than the global one, which 
implies that each name has a unique "true" distribution that lies between the global 
and the local one. However, this conclusion can be challenged due to insufficient 
queries (8 infrequent names) and the small improvement (about 4%). 

Since our data consist both ABC and CNN news, it is interesting to know if these 
two channels have different styles that lead to different distributions. Thus, we train 
two channel-specific global distributions on FSD and test them on FST. As shown in 
Fig.4, this approach ( Time _Channel_Global ) improves MAP over the uniform global 
distribution by only 1%, suggesting that ABC and CNN have similar editing styles. 

Finally, we combine transcript search with time-based smoothed distribution and 
vision information. The combination weights we trained from the held-out set are 1 .0 
for transcript information, -0.812 for anchor filtering and 0.087 for face recognition. 
These weights reflect the fact that face recognition is very unreliable, while the 
anchor detection has the ability to remove false positives. As shown in Fig.4, 
combining transcript with visual information gave another 3% improvement, which is 
mainly derived from anchor detection. Among the 20 people, the visual information 
enhances the MAP on 4 people substantially (over 20%), and we find that they all 
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appear with frontal faces in the video. 10 people have minor improvement (l%-20%) 
on their MAP with visual information, while the rest 6 people do not improve at all. 



5 Conclusion 

In this paper, we address the task of finding a person using clues including transcript, 
video structure, and vision information. Gaussian distribution has been proved 
experimentally an effective model to describe the timing pattern between a person's 
visual appearances and the occurrences of his/her names. Specifically, a "smoothed" 
Gaussian distribution estimated using both the local and global training data produces 
the best performance, especially for infrequently appearing people. Finally, combing 
visual information such as face recognition and anchor detection with transcript 
information brings additional benefit to the person-finding task. 
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Abstract. We address the problem of classifying scenes from feature films into 
semantic categories and propose a robust framework for this problem. We propose 
that the Finite State Machines (FSM) are suitable for detecting and classifying 
scenes and demonstrate their usage for three types of movie scenes; conversation, 
suspense and action. Our framework utilizes the structural information of the 
scenes together with the low and mid-level features. Low level features of video 
including motion and audio energy and a mid-level feature, face detection, are 
used in our approach. The transitions of the FSMs are determined by the features 
of each shot in the scene. Our FSMs have been experimented on over 60 clips and 
convincing results have been achieved. 



1 Introduction 

Recent years have seen a growing interest in the annotation and retrieval of video data. 
The increasing number of subscribers to digital cable now demands efficient tools so 
that viewers can browse and search sections of interest of video. Among many genres 
of video production, feature films are a vital field for the application of such tools. It is 
a sizeable element of the entertainment industry, easily available, widely watched and 
therefore, is becoming a focus of researchers in many aspects. For example, applications 
for content-based video annotation and retrieval have been developed at all levels of the 
video structure; shot level, scene level, and movie level. A shot is a sequence of images 
that preserve consistent background settings. It is the basic element of a movie. A scene, 
which consists of a set of continuous shots, constitutes a portion of the story line. On the 
highest level, a movie is composed of a series of related scenes defining a theme. For a 
user, who may be looking for a particular scene of a feature film, a shot level analysis is 
insufficient since a shot level analysis fails to capture the semantics of the video content. 
For example, how does one answer a query for a suspense scene in a feature film based 
on a single shot content? Any semantic category like suspense or tragedy, cannot be 
defined over a single shot. These concepts are induced in viewers over time. Indeed, 
a meaningful result can only be achieved by exploiting the interconnections of shot 
content. 



P. Enser et at. (Eds.): CIVR 2004, LNCS 3115, pp. 279-288, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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In this paper, we present a novel framework for classifying scenes, focusing on 
feature films, into three semantic categories; conversation, suspense and action. This 
method analyzes the structural information of the scenes based on the low-level and 
mid-level shot features which are robust and easily computable. The low-level features 
used in our framework include shot motion and audio energy and the mid-level feature is 
face identity. To bridge the gap between the low and mid-level features and a high-level 
semantic category, Finite State Machines are studied and developed. The transitions are 
determined based on the statistics of these features for each shot. This paper is organized 
as follows: Related work is discussed in Section 2, Section 3 describes the classification 
framework, including the features and the Finite State Machines for detecting conversa- 
tion, suspense and action scenes. Section 4 shows the experimental results and Section 
5 concludes our work. 

2 Related Work 

In the area of higher level scene understanding, Adams et al [ 1 ] proposed the detection 
of “tempo” in movies. The camera motion magnitude and the shot length were the two 
features used to compute a continuous function. Our framework, on the other hand, ana- 
lyzes the structure of the movie scene and classify scenes into more specific categories. 
Yoshitaka et al. [3] also used shot length and visual dynamics to analyze scene type. 
In their approach, the color statistics of the frames in the shot were used to calculate 
the visual dynamics and the similarities between the repeating shots were exploited. 
Experiments on only one kind of scene was demonstrated and it was not clear how the 
approach could be extended to other scene categories. 

Lienhart et al. [4] used face detection in the scenes to link similar shots. A “face- 
based class” with a group of related frames showing the same actor was constructed by 
the similarity of the spatial positions and sizes of the detected faces. These “face-based 
classes” were linked across shots in the video to form the "face-based sets” by using 
Eigenfaces. The pattern of a dialog scene was flagged if several conditions were satisfied. 
In their experiment, face recognition suffered accuracy and the system typically split the 
same actor into different sets causing overdetection. Li et al. [5], exploited the global 
structural information of a scene and built “shot sinks” to classify a scene into one of three 
scenarios including two speaker dialog, multi-speaker dialog , and others. The overall 
structure was computed based on the low-level visual features, such as color of the shots 
in the scene. In their approach face information, which is an important cue for speaker 
detection, was not used. We combine both structure and face detection in a Finite State 
Machine framework to provide a more general solution for the scene classification task. 



3 Proposed Approach 

In this section, we first discuss the low-level and mid-level features used in our approach. 
The activity intensity, which is a function of low-level features and includes local and 
global motion and audio signal, is the input to the Finite State Machines. Human faces are 
detected in shots, clustered, and also used as input. We construct FSMs for three different 
semantic categories of scenes. These include conversational, suspense and action. 
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3.1 Computing Activity Intensity (T) Using Motion and Audio 

Motion in the videos has been used by several researchers in detecting and identifying 
scenes in feature films. Some examples are [1,2]. In feature films, the camera motion 
is generally translation and zoom, whereas, camera roll and tilt are rare. Affine motion 
model is suitable for capturing translation, scale and rotation about the optical axis 
of camera. Therefore, we model the image-to-image global transformation using an 
affine motion model. We exploit the motion vector information embedded in the MPEG 
compressed video. The approximate motion model is computed based on the 16x16 pixel 
macro-blocks. For each macro-block [x y ] T , its motion [it v } T is computed as 
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where [b\ b 2 ] T vector captures the global translation. The magnitude m of the translation 
vector represents the intensity of the motion, and its absolute difference d across adjacent 
frames gives the smoothness of the camera motion. Thus, an average global motion 
quantity, A over the entire shot captures both intensity and the smoothness of the global 
motion, that is: 

A = (m + K m ) x (d+ K d ), (2) 

where K rn and k,j are small positive constants to avoid multiplication with zero. The 
local motion intensity also varies with the type of scene. For example, in a fighting scene, 
high action intensity is commonly observed. We compute the local motion intensity by 
computing the mean difference, /j, of the reprojected motion vectors and the original 
motion vectors for the entire shot. 



• 4[f*- 

(a) 




Fig. 1. The audio signal for (a) conversation, and (b) action. 



Sound also plays an important role in distinguishing scenes from each other. In 
conversational scenes, characters speak smoothly and calmly. In action scenes, which 
often include explosions, collisions, or vehicle chases, the audio energy is very high. 
Figure 1 shows the plot of audio signals for (a) conversational and (b) action scenes. 
Note that the high energy in the audio of the action scene is distinctive from that of the 
conversational scene. Therefore, the computation of activity intensity also incorporates 
the mean audio energy 0. The overall activity intensity is the combination of the three 
quantities. A, y and 9 as follows: 



r = Ax (m + 1 ) X (0 + 1 ) 



(3) 
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Fig. 2. The activity A of three types of movie scenes (a) conversation, (b) suspense, and (c) action. 
The horizontal axis represents the shot number in the scene. 

Figure 2 shows the histogram of the activity intensity values for three types of shots: 

(а) conversation, (b) action and (c) suspense. 

3.2 Face Detection 

Conversational scenes generally have shots with at least two humans. We utilize this 
cue and detect human faces in the video using the method proposed by Viola et al. 

[б] . We have found that [6] performs reasonably good for faces with different scales 
in the video. The shots containing faces are clustered together based on a 24-bin RGB 
histogram. Figure 3 shows human faces in some shots. 








Fig. 3. Results of face detection in a scene. 



3.3 Finite State Machines (FSM) 

A Finite State Machine is defined as, 

A = (Q, £,a,q 0 ,F), (4) 

where Q is a set of states in the FSM and a is the set of transitions. £ contains the condi- 
tions for the transitions, go is the initial state, and F is the set of accepting (final) state. In 
feature films, scenes are generally composed in accordance with the conventional film 
grammar. We have observed the following characteristics for three different categories 
of scenes: 

- (i) Conversational scenes: low activity intensity, medium audio energy and multiple 
speakers. 

- (ii) Suspense scenes: a long period of silence followed by a sudden eruption either 
in sound track or in activity intensity or both. 

- (iii) Action scenes: intensive action activity for a certain number of shots. 

We discuss three different FSMs which detect conversational, suspense and action scenes. 
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Fig. 4. Finite state machine for conversation scene detection. It consists of six states. 

FSM for Conversation Scenes. Figure 4 shows a deterministic Finite State Ma- 
chine for detecting conversation scenes. The FSM consists of six states: Start, Pri- 
mary Speaker, Secondary Speaker, Others, Reject and Accept. Shots with high sim- 
ilarity that contain a face are clustered together. The state Primary Speaker is rep- 
resented by the largest cluster, and the Secondary Speaker is represented by the 
second largest cluster. The transitions are determined based on the feature values 
of the shots in the scene. If the state Accept is reached, the scene is declared as 
“Conversation” scene. Otherwise it is declared as “Non-Conversation”. In this FSM, 
Q = {Start, Primary Speaker, Secondary Speaker, Others, Reject, Accept}, 
qo = {Start} is the initial state and F = {Accept} is the final state. The set of the 
transitions a includes {e, a, b, c, d, e, f, g,h,k, m, n, p, q, r, s}. The transition matrix for 
a is shown in Table 1. The transition conditions £ are: 

- a: The first shot in the scene is a facial shot with low activity intensity. Results in 
the transition to the state Primary Speaker. 

- b: The first shot in the scene is a non-facial shot with low activity intensity. Results 
in the transition to the state Others. 

- d and f: The new shot is a facial shot with low activity intensity, and it belongs to 
the largest cluster. Results in the transition to the state Primary Speaker. 

- c and k: The new shot is a facial shot with low activity intensity, and it belongs to 
the second largest cluster. Results in the transition to the state Secondary Speaker. 

- e, h and r: The new shot is a non-facial shot with low activity intensity or the new 
shot is a facial shot with low activity intensity but belongs neither to the largest 
cluster nor the second largest cluster. Results in the transition to the state Others. 

- g, n and m: The new shot has high activity intensity. Results in the transition to the 
state Reject. 

- p: The new shot is a facial shot with low activity intensity. It completes the accepting 
requirement of the FSM. Results in the transition to Accept. 

- q: For any new shot, this transition loops at the state Accept. 

- s: For any new shot, this transition loops at the state Reject. 
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Table 1 . Transition matrix for conversation detection. Columns represent “From” states, rows 
represent “To” states and indicates no transition from one state to another. 
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FSM for Suspense Scenes. We have observed that suspense scenes often have the 
following pattern. In the beginning, the scene is relatively silent and is followed by a 
sudden increase in sound energy. In many cases, it is also accompanied by abrupt camera 
and actor movements. Based on these observations, the FSM for detecting the suspense 
scenes have the following four states: Start, Wait, Reject and Accept. The state Wait 
represents the pre-action moments. After a period of waiting, the state is transferred to 
Accept if a sudden action shot is seen. The FSM rejects the scenes in which the sudden 
action happens before the predefined interval. 

Similarly, the definition of the FSM for the classification of suspense scenes can 
be written in the general formula for the Finite State Machines. In this case, Q = 
{Start, Wait, Reject, Accept} are the states. The initial state is <70 = {Start}, and 
the final state is F = {Accept}. The transition set a includes {e, a, b, c, d, e, f, h}. The 
FSM is shown in Figure 5, and the corresponding transition matrix is shown in Table 2. 
The transition conditions are defined as follows: 

- a: The first shot in the scene is a shot with low activity intensity. Results in the 
transition to the state Wait. 

- b: The first shot in the scene is a shot with high activity intensity. Results in the 
transition to the state Reject if the action happens before a predefined time interval. 

- c: The new shot is a shot with high activity intensity, and the waiting time is less 
than the required period. Results in a transition to the state Reject. 

- d: The new shot is a shot with low activity intensity, and the waiting time is less 
than the required period. Loops at the state Wait. 

- e: For any new shot, loops at the state Reject. 

- g: The new shot is a shot with high activity intensity, and the waiting time is more 
than the required period. Results in the transition to the state Accept. 

- h: For any new shot, loops at the state Accent. 



FSM for Action Scenes. Action scenes in movies generally have very high action 
intensity, such as scenes containing explosions, chasing and fighting. To classify a scene 
as an action scene, the scene must have three or more shots with action intensity higher 
than a threshold. The FSM for detecting action scenes is shown in Figure 

For action FSM, the state set Q has {Start, FirstAction, SecondAction, Non — 
Action, AcceptjT hird Action)}, where the initial state qo is {Start}, and 
the final state F is {Accept(ThirdAction)}. The transition set a includes 
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Fig. 5. Finite state machine for suspense scene detection. It consists of four states. 




Fig. 6. Finite state machine for action scene detection. It consists of five states. 

Table 2. Transition matrix for suspense scene detection. Column represent “From” states, rows 
represent “To” states and indicates no transition from one state to another. 
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{e,a,b, c,d,e, f, g, h,k,m,n}. A Previous State attribute for a state q, in the FSM 
and defined as the “from” state of the immediate transition before reaching state q., . This 
is used for the determination of the outgoing transitions from state Non-Action. The 
transition matrix is shown in Table 3. The transition conditions are: 6. 

- a: The first shot in the scene has high activity intensity. Results in the transition to 
the First Action state. 
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- b: The first shot in the scene has low activity intensity. Results in the transition to 
the state Non-Action. The Previous State is set to Start. 

- c: The new shot has high activity intensity. Results in the transition to Second Action. 

- d: The new shot has high activity intensity. Results in the transition to the state 
Accept (Third Action). 

- e: For any new shot, loops at Accept (Third Action) . 

- f: The new shot has low activity intensity. Results in the transition to Non-Action. 
The Previous State is set to First Action. 

- g: The new shot has high activity intensity. Results in the transition to the state First 
Action. The Previous State is set to Start. 

- h: The new shot has low activity intensity. Results in the transition to the state 
Non-Action. The Previous State is set to Second Action. 

- k: The new shot has high activity intensity. Results in the transition to the state 
Second Action. The Previous State is First Action. 

- m: The new shot has high activity intensity. Results in the transition to the state 
Accept (Third Action). The Previous State is Second Action. 

- n: The new shot has low activity intensity. Loops at Non-Action. 



Table 3. Transition matrix for action scene detection. Column represent “From” states, row rep- 
resent “To” states and indicates no transition from one state to another. 
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4 Experimental Results 

We have experimented with over 60 clips using the Finite State Machines for 3 categories 
of scenes. These clips are taken from 7 Hollywood movies including “The Others”, 
“Jurasic Park III”, “Terminator II”, “Gone in 60 Seconds”, “Mission Impossible II”, 
“Dr. No”, and “Scream”. We also included a TV talk show, “Larry King Live” and a TV 
news program, “CNN Headlines”. The feature movies cover a variety of genres such as 
horror, drama, and action. Each clip contains approximately 20-30 shots. Four human 
observers were asked to choose the most suitable label from three categories for each 
clip. Each clip was given a ground truth label with the category that the most human 
observers agreed upon. Thus, each clip is considered as a positive member of the category 
to which it is assigned. Observers were also asked to provide the most unlikely category 
for each clip. We used this information to label a clip as a non-member (or a negative 
member) for the unlikely categories. 
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To evaluate the performance of the proposed approach, two measures of accuracy 
were computed. These measures are precision and recall and defined as follows: 
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where P pos , R pos , Pneg and R n eg are the precision and recall for positive and negative 
member detection. G pos and G neg are the ground truth. D pos and D neg are the detected 
positive and negative members. M pos and M neg are the numbers of the correctly matched 
positive and negative members. 

There were 27 conversational scenes in the data set. The results achieved were 96.2% 
precision and 92.6% recall. For the other 25 non-conversational scenes, the precision 
was 92.0%, and the recall was 95.9%. The number of positive members of the suspense 
category in the data set was 12, with 15 non-member scenes. The precision and recall 
for the member detection was 100.0% and 93.8%i respectively, and the precision and 
recall for the non-member clip was 91.7%) and 100.0% respectively. In action scenes, we 
had 21 member clips and 29 non-member clips. The precision and recall for the positive 
members was 87.0% and 95.2% respectively. The precision and recall for the negative 
members are 96.3% and 89.7% respectively. The overall performance is summarized 
in Table 4. These results clearly demonstrate that a finite state machine can detect and 
classify video scenes into categories. Figure 7 shows some clips with the key frames of 
the shots in the scene. 




(a) Conversation Scene: 007 - Dr. No 




(b) Suspense Scene: Scream 








(c) Action Scene: Terminator II 



Fig. 7. Three testing clips. The six representative key frames are displayed. 



5 Conclusions 

In this paper, we presented a novel framework for classifying video scenes into high-level 
semantic categories using deterministic Finite State Machine (FSM). The transitions in 
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Table 4. Precision and recall for conversation, suspense and action scene classification. 



Scene Type 


Conversation 


Suspense 


Action J 


Accuracy 


Positive 


Negative 


Positive 


Negative 


Positive 


Negative 


Precision 


96.2% 


92.0% 


100.0% 


93.8% 


87.0% 


95.2% 


Recall 


92.6% 


95.9% 


91.7% 


100.0% 


96.3% 


89.7% 



each FSM are based on the low and mid-level shot features. These features are robust 
and easily computable. We also incorporated face detection to cluster shots and used 
these clusters to determine the transitions of the FSMs. We demonstrated the usefulness 
of FSM for this task by experimenting on over 60 movie clips and achieved high recall 
and precision. In the future, we plan on exploring Finite State Machine to detect scene 
categories for entire movies. 
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Abstract. We describe progress in the automatic detection and identi- 
fication of humans in video, given a minimal number of labelled faces as 
training data. This is an extremely challenging problem due to the many 
sources of variation in a person’s imaged appearance: pose variation, 
scale, illumination, expression, partial occlusion, motion blur, etc. 

The method we have developed combines approaches from computer vi- 
sion, for detection and pose estimation, with those from machine learning 
for classification. We show that the identity of a target face can be deter- 
mined by first proposing faces with similar pose, and then classifying the 
target face as one of the proposed faces or not. Faces at poses differing 
from those of the training data are rendered using a coarse 3-D model 
with multiple texture maps. Furthermore, the texture maps of the model 
can be automatically updated as new poses and expressions are detected. 
We demonstrate results of detecting three characters in a TV situation 
comedy. 



1 Introduction 

The objective of this paper is to annotate video with the identities, location 
within the frame, and pose, of specific people. This requires both detection and 
recognition of the individuals. Our motivation for this is two fold: firstly, we want 
to annotate video material, such as situation comedies and feature films, with 
the principal characters as a first step towards producing a visual description of 
shots suitable for blind people, e.g. “character A looks at character B and moves 
towards him”. Secondly, we want to add index keys to each frame/shot so that 
the video is searchable. This enables new functionality such as “intelligent fast 
forwards” , where the video can be chosen to play only shots containing a specific 
character; and character-based search, where shots containing a set of characters 
(or not containing certain characters) can easily be obtained. 

The methods we are developing are suitable for any video material, includ- 
ing news footage and home videos, but here we present results on detecting 
characters in an episode of the BBC situation comedy ‘Fawlty Towers’. Since 
some shots are close-ups or contain only face and upper body, we concentrate 
on detecting and recognizing the face rather than the whole body. 

The task is a staggeringly difficult one. We must cope with large changes in 
scale: faces vary in size from 200 pixels to as little as 15 pixels (i.e. very low 
resolution), partial occlusion, varying lighting, poor image quality, and motion 
blur. In a typical episode the face of a principal character (Basil) appears frontal 
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in one third of the frames, in profile in one third, and from behind in the other 
third, so we have to deal with a much greater range of pose than is usual in face 
detection. 

Previous approaches to character identification have concentrated on frontal 
faces [7,9]. This is for two reasons: (i) face detection is now quite mature and suc- 
cessful for frontal faces [10,15,16] (both in terms of false positive/ false negative 
performance, and also in efficiency); and (ii) because most recognition meth- 
ods are developed for frontal faces [17]. For example, image-based ‘eigenface’ or 
‘Fisherface’ [2] approaches are successful for registered frontal faces with stable 
illumination. Detection of profile faces [15] or arbitrary pose [10,12] has not yet 
reached the same level of performance. This is principally because in the case of 
frontal faces pattern matching methods can be used to classify an image region 
through a fixed mask as a face or non-face, since there are sufficient distinctive 
internal features visible (eyes, mouth, etc.). In the case of profiles there are fewer 
distinctive features, and the silhouette varies. Consequently, simple fixed regions 
of interest include background, and the resulting learning problem is then much 
more difficult. 

The approach we have developed is closest in spirit to the pose and multiple 
view based approaches of [3,5,13]. Suppose that we have identified a face region 
in a target frame, and our task is now to decide if this is the face of one of the 
characters in our training data. This is a matching problem, and in the case of 
faces we must account for three principal ‘dimensions’ of variation: pose change, 
illumination change, and expression change. Conceptually we divide this problem 
into two parts: 

1. Pose based rendering: a set of candidate faces is proposed by rendering faces 
from the training data at the same pose as the target face, see figure 4. The 
candidate faces will typically contain several examples of the correct face with 
a range of expressions, as well as examples of other characters. This largely 
eliminates the pose variation, and we have reduced the problem to matching 
over expression and illumination change. 

2. Classification: a matching decision is made amongst the proposed faces. The 
outcome is a match with one of the faces, or a non-match (if the target face 
is not one of the learnt characters). This requires a matching measure which 
is tolerant to small changes of expression, and largely invariant to illumination 
conditions. 



2 Approach 

In this section we describe the two stages of the algorithm: learning face models, 
and recognition of faces in target frames. The overall recognition approach con- 
sists of three steps: (i) detecting candidate face regions in the target frame, (ii) 
determining the pose of the target face and proposing candidate faces at that 
pose, and (iii) classification. 




Automated Person Identification in Video 



291 




a) Input image b) Skin probability c) Candidate face regions 

Fig. 1. Candidate face region detection using skin colour model and multi-scale blob 
detector. Darker grey levels in (b) represent higher probability. Concentric circles in (c) 
show the scale uncertainty in the detections. Note there are several false positives due 
to non-face skin regions, and non-skin regions of similar colour. These false positives 
will be removed by subsequent verification. 

2.1 Candidate Face Region Detection 

The first step in detection is to propose candidate face regions in an image for 
further processing. Requirements are that the algorithm proposes all faces in 
the image as candidates across a wide range of scale and pose. We desire to 
have a relatively small number of false positive (non-face) responses from the 
algorithm, since processing false detections incurs computational cost, but we 
can cope with some false positives since candidate regions will be subsequently 
verified. This differs somewhat from the isolated problem of face detection [15, 
16], where detections are not subject to additional verification. 

We take advantage of working with colour video and use a skin colour de- 
tector to propose probable face regions. The probability distribution over the 
colour of skin pixels in RGB space is modelled as a single Gaussian with full co- 
variance. A corresponding Gaussian distribution with large variance is estimated 
for ‘background’ pixels, and Bayes theorem is applied to obtain an image of the 
posterior probability that each pixel is skin. Skin blob detection is performed 
over an image pyramid by applying a Difference of Gaussians (DOG) operator 
[14] to the skin probability image at each level. A face region is declared at local 
maxima in the DOG response with positive response above threshold, and cor- 
responding high skin probability. The approximate scale of the face is obtained 
from the pyramid level. Figure 1 shows an example image, skin probability, and 
detected candidate face regions. 

2.2 Pose Based Face Rendering 

We require a method of rendering faces at poses different from those in the 
training material. The approach used here is to combine coarse 3-D geometry 
with multiple texture maps. The model has two parts: a global 3-D geometric 
model of the head, and a set of visual ‘aspects’ which define appearance over 
local regions of pose space. The shape of the head is modelled simply as an 
ellipsoid, the parameters of which are fitted to a single training image of the 
person. Figure 2a shows a training image for the ‘Basil’ model, and Figure 2b 
the ellipsoid model overlaid. The aims of using a 3-D model for the head are two 
fold: 
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a) Training image b) Ellipsoid c) Texture d) Novel view 

Fig. 2. Ellipsoid head model. The triangulation shown (a) is coarser than that used, 
to aid visibility. The blank area of the texture map is the back of the head, which has 
not yet been observed. 



1. Extrapolation: The 3-D model allows us to extrapolate some way from a 
single view of the person and propose how the person looks in nearby poses. 
The single training image is back-projected onto the ellipsoid to give a texture 
map (Figure 2c), then a new view of the head in a different pose can then 
be rendered by transforming the ellipsoid and projecting the texture map back 
into the image. Figure 2d shows an example: for poses near to the one from 
which the texture map was obtained, fairly accurate images can be rendered. 
Because the ellipsoid geometry only approximates the head shape, the realism 
of the rendered views degrades as the pose change increases, principally because 
the ellipsoid does not predict self occlusions (such as the eye being occluded as 
the face looks down). However, it will be seen that combining a simple shape 
model with multiple texture maps enables accurate rendering of many poses. By 
contrast, an accurate 3-D model could extrapolate further from a single view, 
but it is difficult to obtain such an accurate model, and an inaccurate but non- 
smooth model can introduce many artifacts that we wish to avoid. Ellipsoids [1] 
and close relatives (superquadrics [11], tapered ellipsoids [13]) have been applied 
successfully to head tracking by several authors. 

2. Pose space: The second reason for the 3-D model is that it provides a global 
reference frame against which any image of the face can be aligned. Initially, 
having seen just a single image of the face, we have a good idea of the appearance 
in only a narrow range of poses, and with fixed facial expression. Estimating the 
pose of a new image and verifying the identity of the person allows a new image to 
be classified as: (i) close in pose and appearance to an already seen image, (ii) in 
a pose far from one observed up to this point, or (iii) in a known pose but with 
differing appearance (facial expression). In the latter two cases the algorithm 
considers expanding the model by adding additional texture maps, positioning 
them appropriately in pose space. This allows the model to be improved without 
manual supervision. 

2.3 Pose Estimation 

Given a candidate face region in the image, the pose of the face is recovered by 
search in the joint pose/appearance space, proposing the appearance of the face 
and comparing against the target image. The pose is parameterized as a 6-D vec- 
tor p = (0,<j>,^},a,T x ,Ty) corresponding to rotation, scale, and 2-D translation 
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Fig. 3. Pose estimation (best viewed in colour). Top rows show original image, middle 
rows show ellipsoid overlaid at the estimated pose, bottom rows show overlaid model 
rendered at the estimated pose. 



in the image. Rotation is specified by azimuth 0, elevation and in-plane ro- 
tation if). This parameterization allows reasonable bounds to be specified easily. 
A candidate face region provides an initial estimate of scale cr, up to the scale 
step between pyramid levels, and translation (t x , t v ) (the centre of the candidate 
region) . The task is to find the pose parameters p which maximize the similarity 
between the rendered view R(p, p) and the target image I. Normalized cross- 
correlation (NCC), masked by the silhouette of the rendered view, is used as the 
similarity measure: 



p = arg max 

p 



max NCC (I,R(p,p)) 

M6{m p } 



(1) 



For a given pose, multiple appearances R(p, p) are proposed by selecting a 
subset of the texture maps {p p } which are (i) close to the current pose, and (ii) 
varying in expression. This is done by first finding the texture map which has 
pose q closest to the current estimate p , then selecting all texture maps with 
pose close to q (which represent different facial expressions). Distance between 
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Fig. 4. Face classification based on multiple appearance proposals. The leftmost image 
is the target, with rows showing proposals rendered from the Manuel, Basil and Sybil 
models. The task is to decide which proposal to accept, or to reject all. 



poses is computed by the dot product between a front-facing vector normal to 
the ellipsoid, so that in-plane rotation about the frontal view does not influence 
the distance. Using this ‘nearest neighbour plus siblings’ approach to selecting 
texture maps allows the algorithm to consider texture maps corresponding both 
to close poses and varying facial expression. Numerical optimization is carried 
out using the coordinate descent algorithm of [8]. Figure 3 shows examples of 
pose estimation. Additional examples can be seen in Figure 4, discussed below. 



2.4 Classification 

Given an estimated pose, a set of images is proposed by the models of each 
person. Figure 4 shows an example, with each person model attempting to re- 
produce the leftmost image, of Basil. Note that the proposals here have the same 
pose but vary in facial expression. The aim now is to obtain a representation of 
the face image suitable for person classification, capturing the essential structure 
of the facial appearance but allowing for small local misalignments between the 
original and rendered images due to factors such as the approximation of the 
face shape as ellipsoidal. Using this representation, one of the proposed images 
may be accepted as a match, yielding classification of the person, or all may be 
discounted, in the case of a non-face region, or person other than those modelled. 

Use of ‘edges’ rather than raw grey levels for emphasizing salient image struc- 
ture has been proposed in many contexts [14] and an edge-based descriptor is 
used here, proposed most recently for comparing optical flow fields [6]. For an 
image I, the image gradients I x , I y are computed, and half-wave rectified to form 
four non-negative channels /+ , I ~ , I y , I~ . Each channel is then blurred with a 
Gaussian to give some robustness to local image deformations, and the descrip- 
tor for the image D{I ) is formed by normalizing and concatenating the four 
channels. The non-negativity and relative sparseness of signal in each channel 
allows the channels to be blurred without destroying orientation information or 
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Fig. 5. Gradient descriptor for target (top) and rendered (bottom) images. Darker grey 
levels represent larger values. The similarity measure here is 0.98, a close match. 




Fig. 6. Model update by tracking. A colour tracker successfully tracks the face over 
large pose variation and is used to validate proposed updates to the model. 



edges by cancelling positive and negative gradients. The width of the Gaussian 
is set proportional to the scale of the face in the image. 

When comparing descriptors for a target image I and a rendered view of the 
ellipsoid R(p , p), the rendered view is overlaid on the target image (in the man- 
ner of Figure 4) before computing image gradients in order to avoid introducing 
spurious edges due to the ellipsoid boundary. Similarity between the correspond- 
ing descriptors D(I) and D(R(p , p)) is obtained by correlation, considering only 
pixels within the ellipsoid mask. Figure 5 shows example descriptors for target 
and rendered images. 

2.5 Model Learning 

The supervision required for learning the face model is minimal: a face for each 
character is identified in one frame, and the ellipsoid model fitted. Additional 
training is automatic, as will now be described. 

Having computed the similarity (section 2.4) between a set of face candidates 
and a particular person, a decision is made as to which detections to add to the 
model as new texture maps, enabling the model to cope with wider variations 
in expression and pose. 

A low threshold on similarity ti is defined, above which we are confident 
that a detection matches a particular person. Three cases then follow: (i) if the 
similarity of a match is above a second higher threshold {t > th > U) and the 
pose is close to one already seen, then the image need not be added to the 
model, (ii) If however the match is certain (t > th) and the pose is far from one 
already seen, the image is added to the model so that the range of pose covered 
is expanded. Finally (iii), less certain matches (f; < t < th) which lie close to an 
existing pose are validated by tracking. These would typically represent unseen 
facial expressions. To validate such matches, temporal coherence of the video 
is exploited: a tracker is run from frames with certain matches, ending at the 
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Fig. 7. ROC curves for three characters in 1,500 key frames. Successful identification 
requires correct detection, pose estimation, and recognition. In all cases, unsupervised 
model update improves the accuracy of the model. 



candidate frame. The tracker used is a colour version of a deformable region 
tracker [4], If the position of the tracked region agrees with the detected face, 
then the model is updated. Figure 6 shows an example of successful tracking 
over wide pose variation. 

3 Experimental Results 

The algorithm was tested on 1,500 key-frames taken one per second from the 
episode ‘A Touch of Class’ of the sitcom ‘Fawlty Towers’. We evaluated detec- 
tion of three of the main characters: Basil, Sybil and Manuel. The task was to 
detect the frames containing each character, and identify the image position and 
pose of the face correctly. Correctness was measured by the distance to ground 
truth points marked on the eyes, nose and ears according to pose, requiring 
distance of all predicted points to be less than 0.3 of the inter-ocular distance. 
Corresponding points for the model (for testing purposes only) were obtained 
by back-projecting the ground truth points onto the ellipsoid during training 
and model update. Pose of the ground truth faces in the video covers poses of 
around +/-60 0 azimuth, +/-30 0 elevation and +/-45° in-plane rotation. Faces 
vary in scale from 15 to 200 pixels. The values of the thresholds ti and th were 
determined from a validation set and kept fixed throughout the experiments. 

Figure 7 shows ROC curves for each of the three characters. Note that we 
treat the problem as one of detection rather than 1-of-m classification since we 
do not know a priori all the characters in the video. For each character, curves 
are shown for the initial model and two runs of the model update procedure. 
The number of texture maps after model update varied for each character, due 
to the varying number of frames in which the character appears and differences 
in pose variation between characters, and is shown in the legend. The graphs 
show clear improvement in the accuracy of the model after update, for example 
in the Basil model the equal error rate decreases from 30% to 15% after two 
rounds of update. At this stage, characters can be detected in 75-95% of frames 
at a false positive rate of 10%. These results are extremely promising given the 
difficulty of the task. It is interesting to observe that the performance on Sybil 
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is notably better than the other characters; this is the ‘moustache problem’ - 
the moustache is a strong visual feature shared between Basil and Manuel, and 
indeed three other secondary characters in the episode, which gives much scope 
for confusion. 

4 Discussion 

We have presented methods for detecting and identifying characters in video 
across wide variations in pose and appearance by combining a simple 3-D model 
with view-dependent texture mapping. Placing the views of the face in a common 
reference frame allows more efficient search than possible with an unorganized 
collection of images, and provides a basis for automatic model update. Use of a 
simple 3-D model rather than a detailed face model [3] avoids introducing severe 
rendering artifacts due to incorrect modelling of self-occlusion, and multiple 
texture maps allow facial expressions to be modelled, which is challenging for 
3-D models with a fixed texture map. 



Acknowledgements. Thanks to EC Project CogViSys for funding. 
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Abstract. A new method allowing for semantically guided image editing and 
synthesis is introduced. The editing process is made considerably easier and more 
powerful with our content-aware tool. We construct a database of image regions 
annotated with a carefully chosen vocabulary and utilize recent advances in texture 
synthesis algorithms to generate new and unique image regions from this database 
of material. These new regions are then seamlessly composited into a user’s ex- 
isting photograph. The goal is to empower the end user with the ability to edit 
existing photographs and synthesize new ones on a high semantic level. Plausible 
results are generated using a small prototype database and showcase some of the 
editing possibilities that such a system affords. 



1 Introduction and Related Work 

The ongoing digital media revolution has resulted in an untold amount of new digital 
content in the form of images and videos. Much of this digital content is generated by 
capturing real scenarios using cameras. We feel that digital media production would 
become much simpler, more effective, and cheaper if intuitive tools were available for 
combining existing photos to generate new images. Merging images and image segments 
together to form new images has been around for a long time in the form of image and 
video compositing. The composition process is however tedious and best carried out 
by the practiced eye of an artist. For such synthesis an artist combines various image 
regions from different sources to achieve a predefined content and context in the final 
image. We seek to simplify this process by providing a content-aware tool for generating 
new images. 

To facilitate this form of semantic image synthesis, we first create a database of 
imagery which has regions annotated with semantic labels and image characteristics. 
Then, based on the content that the user desires, the user can query this database for 
imagery that will suit the region that will be composited. The user then chooses an image 
region from the query results, which acts as a source for texture synthesis algorithms 
[1,2,3], The synthesized region is composited into the image being edited. The system 
pipeline is shown in Fig. 1. In this way a user can edit a photo by synthesizing any 
number of user defined regions from an existing database of imagery. We have termed 
this semantically guided media recombination process Content Based Image Synthesis 
(CBIS). 

At the heart of the CBIS method is a reliance on semantic annotations of regions (e.g. 
sky, mountain, trees etc.) so that if the user wants to synthesis a new mountain range into 
his photo he can search the database for other mountain regions. The idea of using such 
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Fig. 1 . The flow of the CBIS application. An input image is loaded and the area to replace masked. 
A query against an annotated image region database is then made by a user in order to find suitable 
content with which to fill the area. This source region is then used by a texture synthesis algorithm 
to produce a new region which is finally composited into the input image. 



high level annotations for doing search and synthesis has been applied successfully by 
Arikan [4] in the domain of motion data for animation. Their vocabulary, consisting of 
such verbs as walk, run, and jump is used to annotated a database of motion data. When 
the vocabulary is applied to a timeline, the system synthesizes plausible motion which 
corresponds to the annotated timeline. 

The analogue in the image domain is the segmentation and annotation of distinct im- 
age regions with high level semantic tags. Zalesny [5] segments textures into subtextures 
based on clique partitioning. Each subtexture is represented by a label in a spatial label 
map. The label map guides their texture synthesis procedure such that the correct sub 
texture appears in the correct spatial position. Hertzmann [6] also defines some notion 
of a label map in his texture-by-numbers synthesis scenario. Our method borrows from 
these ideas for a label map but defines a much higher level mapping, less based on image 
statistics and more based on the semantic annotation of the region. For example, the 
user may desire one region to be synthesized as sky and another region as mountain. 
These region annotations ultimately map to the regions from the database that have been 
selected by the user as synthesis sources. 

In the next section we further motivate the semantic power of such a CBIS application 
using semiotics. In section 3 we detail our methods for segmentation and annotation of 
images. Section 4 lays out our query method and details the hybrid texture synthesis 
approach used. Finally, we show some results that have been generated with the system 
in 5 and conclude with a number of directions for future work. 

2 Semiotics Basics 

Denotation and Connotation: Semiotics provides us with a rich set of abstractions and 
quasi-theoretical foundations supporting our recombinant media application. A user may 
want to composite a newly synthesized region into an existing photograph for practical 
purposes, aesthetic reasons or, from a semiotic perspective, to change the meaning of the 
photograph. Barthes [7] identifies several ways in which one can alter the connotation 
of an image (e.g composition, pose, object substitution/insertion, photogenia etc.). Our 
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system thus uses image composition to enable the user to alter both the denotation 
(pixels) and the connotation. For example, given the photo of the stalin statue with the 
happy sky seen in Fig. 1 (input), we might want to substitute a cloudy or stormy sky 
(output). The resultant change in meaning is left to the subjective interpretation of the 
reader. 

Structural Analysis: Semiotic systems can be defined structurally along two primary 
axes: syntagm and paradigm. The syntagm defines the spatial positioning of elements in 
an image, whereas the paradigm is defined by the class of substitutions that can be made 
for a given element [7,8,9]. A linguistic analogy is usually helpful for understanding. 
The syntagm of a sentence corresponds to its grammatical structure (e.g. noun-verb- 
prepositional phrase); the paradigm corresponds to the set of valid substitutions for each 
word in the sentence. For instance, a verb should be substituted with a verb in order that 
the sentence still make sense. 

Syntagm and paradigm are the structural axes along which a user may vary an image 
in the course of editing. We can consider three combinations of variation along these 
axes: (1) vary the paradigm and fix the syntagm; (2) fix the paradigm and vary the 
syntagm; or (3) vary both the paradigm and syntagm. (1) is analogous to playing with a 
Mr. Potatoe Head: the structure of the face remains the same, but various noses, mouths, 
etc. can be substituted into that structure. (2) roughly corresponds to the method of 
texture-by-numbers in [6]. In that work, a new syntagm (a label map) is drawn by a user 
and filled in with the corresponding labelled regions from a source image. (3) defines a 
more difficult operation since changing the syntagm of an image can change the valid 
paradigmatic substitutions as well. We consider variation (1) in this work. The layout of 
the input photo remains the same (i.e. mountains stay mountains of the same shape), but 
the database provides a set of paradigmatic variations (i.e. rocky mountains can become 
snowy mountains). 



3 Database Generation 

The database creation process consists primarily of segmenting meaningful and useful 
regions in images and annotating them with the appropriate words from our annota- 
tion vocabulary. Each of these segmented, annotated regions is then stored in an XML 
structure which can later be queried by the end user. In our prototype system we have 
annotated slightly more than 100 image regions which serve as the database. 

Region Segmentation: The first step in generating the database of images is the segmen- 
tation of meaningful regions in these images. While there has been some recent progress 
in the automatic segmentation of semantically meaningful regions of images [10,1 1,12], 
or even semi-automatic segmentation, currently we opt for a fully manual procedure as 
this allows for a more directed user input of higher semantic import. 

Region Annotation: The annotation vocabulary is chosen to fit the domain of natural 
landscape imagery. This domain makes sense for us since textures are prevalent and 
because a relatively small vocabulary can describe the typical regions in such a scene. 
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The choice of words used to annotate quantities like hue, saturation, and lightness is 
informed by [13,14]. The vocabulary follows: 



Region: {Sky | Mountains | Trees | Vegetation | Water | Earth | Hue | Lightness | Saturation | Distance} 
Sky: {Clear \ Partly Cloudy \ Mostly Cloudy \ Cloudy \ Sunset \ Sunrise \ Stormy } 

Mountains: {Snowy \ Desert \ Rocky \ Forested } 

Trees: {Deciduous \ Coniferous \ Bare} 

Vegetation: { Grass \ Brush \ Flowering} 

Water: {Reflective \ Calm \ Rough \ Ocean | Lake | Stream \ River} 

Earth: {Rocky \ Sandy \ Dirt} 

Hue: {Red \ Orange \ Yellow \ Green \ Blue \ Purple \ Pink \ Brown \ Beige \ Olive \ Black \ White \ Gray} 
Lightness: {Blackish \ Very Dark \ Dark | Medium | Light \ Very Light \ Whitish} 

Saturation: { Grayish \ Moderate \ Strong \ Vivid} 

Distance: {Very Close \ Close \ Medium \ Distant \ Very Distant \ Changing} 



Region annotations are made manually using a GUI. Relying on a fully manual pro- 
cess allows us to work with higher level semantic categories. Of course, as automatic 
annotation methods get better, they can be integrated to supplement the manual proce- 
dure. Given the categories chosen above, there should be little problem with consensus 
on the appropriate annotation(s) for a given region, though in general subjectivity can 
be a problem for manual annotation. 

The hue, saturation, and lightness user annotations are augmented by fuzzy his- 
tograms of HSL pixel values. A 13 bin hue histogram, a 4 bin saturation histogram, 
and a 7 bin lightness histogram are generated based on the HSL pixel values in a given 
region. Bins are fuzzy insofar as a given pixel can contribute (bi-linearly) to adjacent 
bins. Each bin also corresponds to one of the vocabulary words for that category; for 
saturation the 4 bins correspond to <Grayish, Moderate, Strong, Vivid>. A saturation 
histogram of <.8, .2, 0, 0> therefore indicates a very grayish region. 

This dual representation of lower level features should also mitigate the somewhat 
subjective user annotations. Currently, we are also studying other automatic or semi- 
automatic methods for annotating entities such as the lighting direction, or camera per- 
spective. These would serve to make database query results even more pertinent to the 
image into which they will be synthesized. 



4 Image Recombination 

Query: The image recombination procedure begins with the user defining a region that 
will be replaced in his input image (e.g. a mountain range is selected). This region is 
then annotated with keywords from the vocabulary using a drag and drop GUI. This 
annotation is used to query the XML database and return the N most pertinent images 
from which the user selects a source image. Subtle decisions in lighting, perspective, and 
color are thus not made automatically and can be evaluated by the user. This maximizes 
the user’s potential to affect connotation in the image since he has good suggestions 
from the database, but ultimately has the final choice in the paradigmatic substitution. 

Matching of annotated regions proceeds as suggested by Santini [15]. This approach 
allows for binary feature sets (i.e. presence/absence of keywords) to be compared with 
fuzzy predicates. Thus we can compare keyword annotations of a region’s lower-level 
features (e.g. saturation) with the histograms of those features as calculated directly from 
pixel values. 
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Let f(Ri) =< fi(Ri), f 2 (Ri), fp(Ri) > represent a p-dimensional feature vec- 
tor for a region Ri. fk(Ri ) = 1 if region R, has the keyword annotation associated with 
feature k £ {l...p}. For fuzzy features such as saturation we maintain two feature vectors 
with fk(Ri) £ (0, 1). One vector is based directly on the histogram of pixel values. The 
other vector is based on the keyword annotation, but is also made fuzzy. As an example 
let’s consider the two feature vectors describing saturation for R\ . The feature vector cal- 
culated from pixel saturation values might be: f(Ri) =< 0, .1, .7, .2 >. If R\ also has 
the strong keyword annotation the second feature vector is f(Ri) =< 0, .25, 1, .25 >. 
This fuzzyness allows for more meaningful retrieval results since it allows for a smoother 
range of similarity scores. In addition to hue, saturation, and lightness, we make the dis- 
tance vector fuzzy since it also benefits from a gradient score. In general, any attribute 
which can naturally be described using an intensity scale benefits from a fuzzy repre- 
sentation. 

Based on [15] a symmetric similarity function a is defined between two feature 
vectors in the following way, 

v(f(Ra)J(Rb)) = ELi fnin{fi{R a ), fi{Rb)) (!) 

The dissimilarity 5 can be written as, S(f(R a ),f(Rb)) = p — a(f(R a ),f(Rb))- 
We maintain a feature vector for each semantic category of each region, though we 
could concatenate these vectors and arrive at the same result. Comparing two image 
regions then consists of computing dissimilarity scores for each semantic category and 
summing them to arrive at an aggregate dissimilarity score between those two regions. 
Where necessary the dissimilarity of two regions also takes into account the dual feature 
vector representation by equally weighting the histogram vector and fuzzy keyword 
vector in the final score. 

Synthesis: In general there are many methods for doing image based synthesis. Here 
we focus specifically on using patch-based texture synthesis algorithms such as those of 
[1,2,3]. Patch-based approaches work by copying small regions of pixels from a source 
texture such that when these regions are stitched together they give the same impression 
as the original texture. The output texture need not be the same shape or size as the 
source. 

In particular we have implemented the method of texture synthesis and transfer 
detailed by Efros in [1]. Our pixel blocks are rectangular and tiled over the synthesis 
plane with overlapping regions through which seams are optimally cut using a dynamic 
programming algorithm (see Fig. 2). Each successive block of pixels is chosen by scan- 
ning the source texture for areas that will minimize a euclidean error metric as measured 
against imbricated adjacent areas. As many textures in our application are non-stationary 
and change due to perspective effects, we also use vertical correspondence maps when 
calculating the error metric. This ensures that parts of a source texture far away (i.e, at 
the top of a region) are used as samples when synthesizing the parts of the destination 
texture region that are also far away. Though this seems to work well in practice, the 
amount of perspective can vary considerable between source and destination. Vertical 
correspondence maps can fail to generate convincing results in such cases. 
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Though the dynamic programming algorithm for seam generation detailed in [ 1 ] does 
a decent job for textures with some high frequency content, it falters for more slowly 
changing textures such as a sky gradient. In these cases, we smooth these seams with a 
technique in which a Poisson equation is solved across the boundary of the seam [16]. 
This has the effect of creating smooth transitions between blocks without sacrificing the 
saliency of the underlying texture. 




Block Width 



Synthesized Texture Source Texture 




Correspondence Maps 



Fig. 2. A texture is synthesized. The blow up shows two overlapping blocks taken from the source 
and the seam between them. Vertical correspondence maps guide sampling from the source. 




Fig. 3. Results generated using the CBIS application. Each block (a,b) consists of a source texture 
outlined in red (upper left); an input image (lower left); and the output (right). 
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Fig. 4. Results generated using the CBIS application. Each block (a,b,c,d) consists of a source 
texture outlined in red (upper left); an input image (lower left); and the output (right). 



There are a few algorithm parameters that are worth mentioning briefly here. The 
block size must be chosen large enough to capture the largest feature or texton size of 
the source texture. Choosing a value too small can lead to synthesized results which lack 
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the structure of the original. Additionally, we allow the algorithm to iterate across the 
whole synthesized texture a variable number of times such that the initial pass is done 
with larger blocks (thus “laying out” the texture) and subsequent passes done with a 
smaller block size such that finer details are preserved. 



5 Results 

Queries were performed against a small proof of concept database of about 50 images, 
representing just over 100 distinct image regions. Synthesis timings vary according to 
parameters and region size in both the source and destination images, but were in the 
range of about 2 to 30 minutes for a Java implementation on a 2.4 GHz processor. 

Some example images generated using our CBIS system are shown in Fig. 3 and 
Fig. 4. Results can also be viewed online at (http://cpl.cc.gatech.edu/projects/CBIS/); 
we encourage viewing of results digitally. In Fig. 3 (a) a city skyline is inserted; (b) a 
field of flowers is replaced with a field of rocks. Fig. 4 contains additional results: (a) a 
gentle sunset replaces a cloudy sky over Florence; (b) a distant island is inserted; (c) the 
night sky over Atlanta is replaced using the Starry Night painting by van Gogh leading 
to a unique blend of photograph and impressionist painting; (d) a rocky mountain is 
replaced with a snowy mountain. 

The parameters for each of the images in Fig. 3 and Fig. 4 were chosen carefully. In 
particular the block size was chosen to be slightly larger than any repeatable features in 
the source. All sky replacements utilized the Poisson smoothing functionality. Vertical 
correspondence maps were used for all source and destination region pairs with the 
exception of Fig. 4 (c). Also, two synthesis passes were made for each output image. 



6 Conclusions and Future Work 

We have introduced a method for the content-based generation of new images. A user 
defines a region in his image to replace and then queries a database of images annotated 
with region semantics and image characteristics to find a suitable source image. The 
chosen source image is then used in a texture synthesis step to produce the final output 
image. Results, such as those seen in Fig. 3 and Fig. 4, are visually convincing. 

Invoking semiotic theory allows us to view such a CBIS tool as a powerful way of 
changing meaning in photographic content though careful substitution and insertion of 
image elements. The extent of this editing power is dictated only by the size and breadth 
of the underlying database, and of the vocabulary used to annotated it. Thus, there are 
several obvious areas of future work such as expanding the database, annotating image 
regions with additional visual information such as perspective or lighting characteristics, 
and increasing the effective size of the query vocabulary by tying into a system such as 
WordNet. To expand the database we would specifically like to explore adding segmented 
objects to the database. A substitute object could then have the area around it filled in 
using a hole filling algorithm such as [17]. Improvements also need to be made in 
the synthesis phase, such as accounting for lighting effects (e.g. shadow in Fig. 1) or 
interactions between adjacent textures in the final image. 
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Another important direction for future work is in applying a content-based synthesis 
framework to other types of media such as video or audio. To be successfully applied in 
each of these domains several things must first be defined: (1) appropriate segmentation 
methods and units, (2) annotation vocabulary and low-level features, and (3) synthesis 
algorithms that combine the segmented units so that the output is believable. In short, 
workable segmentation and integration algorithms must be found for these other types 
of media. 
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the XML storage system. We appreciate the helpful comments of Stephanie Brubaker in 
preparing this paper. We also acknowledge the copyright holders of image segments used, 
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Abstract. We propose using truncated object-object similarity matrix 
as an access structure for interactive video retrieval. The proposed ap- 
proach offers a scalable solution to retrieval and allows combination of 
different feature spaces or sources of information. Experiments were per- 
formed on TREC Video collections of 2002 and 2003. 



1 Introduction 

With rapid development of digital media in the past decade, Content-based, in- 
formation retrieval (CBIR) has become an active research area. Originally meant 
for text documents, information retrieval quickly became dearly needed for other 
media such as still images and video. Though CBIR usually suggests the retrieval 
of non-textual information, the term does not exclude text documents 1 . The goal 
of a retrieval system is to satisfy the information need of the user. The informa- 
tion need is communicated to the system, e.g. by providing an example query. 

A number of approaches to CBIR exist. The pioneering image retrieval sys- 
tems used large experience existing in the text retrieval domain, successfully 
adopting the vector space model [7,14,10]. Probabilistic approaches from text 
retrieval (e.g. [9,12] gained less popularity among non-text CBIR researches 
with some notable exceptions [3,19,16]. One of the reasons lies in the difficulty of 
translating the lower-level features into probability values. Other recent research 
is inspired by machine learning methods. Self-organising maps [11] and support 
vector machines [4,15] are employed to solve the problems of CBIR. Many exist- 
ing retrieval systems rely on active participation of the searcher in the retrieval 
process, which is known as relevance feedback [13]. 

Regardless of the approach used, a retrieval system should be able to ‘un- 
derstand’ the users’ information need and provide him/her with satisfactory 
answers. The problem is that high-level content of a document, in the way a 
human being understands it, is hard to translate into a machine-language con- 
cept with current techniques for automatic lower-level feature extraction. Rich 
feature spaces might be created in an attempt achieve a correspondence be- 
tween lower-level features and human perception. This immediately creates a 

1 In this paper the term ‘document’ is used in a broad sense, implying any source of 
information such as text, images, videos, etc. 



P. Enser et al. (Eds.): CIVR 2004, LNCS 3115, pp. 308-316, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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disadvantage — a higlr-dimensional space that is not well suited for fast access 
via indexing. It raises the scalability problem: methods that perform well on 
small collections can not be used on a collection of usable size, due to the ‘di- 
mensionality curse’ [6]. In this paper we propose a framework for content-based 
indexing and retrieval, that 

— is able to use any available technique for feature extraction, and allows easy 
combination of different sources of information; 

— focuses on relevance feedback as an important component of the information 
retrieval process; 

— allows efficient interaction with the user, i.e. it offers a solution to the seal- 
ability problem. 

We present a description of the proposed framework in Sec. 2. Interaction be- 
tween the system and the user is studied in Sec. 3. Experiments performed on 
the collection of the TREC Video retrieval workshop (TRECVID) [1] are presented 
in Sec. 4. Conclusions and future work directions can be found in the last section. 



2 Probabilistic Indexing and Retrieval 

Consider a collection X of information objects i among which there is one that 
the user is looking for, the search target denoted T 2 . During the search process, 
the system presents the user with intermediary retrieval results. The user can 
indicate which examples are relevant to his/her information need, those are pos- 
itive examples. If an object is not relevant to the query, the user may indicate 
so, thus providing the system with negative examples. Given the feedback infor- 
mation, the retrieval system produces a new set of candidate documents to be 
assessed by the user. There may be several loops of relevance feedback during 
one search session. 

We want to make use of the notions ‘relevant’ and ‘non-relevant’ without 
having to refer to lower-level (image) features. We do so by relating objects in the 
collection to each other. A binary variable Si, that takes values 1 and 0, denotes 
the events of positive and negative feedback respectively. For two documents the 
following reflects their ‘measure of closeness’: P(Si = 1|T), the probability of an 
object i marked by the user as relevant given that T can be referred to as the 
target for the search. When unambiguous, we use a shorthand notation P(Si\T). 

2.1 Interactive Retrieval in a Probabilistic Framework 

For interactive retrieval we use a probabilistic approach. The idea is to predict 
the set of documents relevant to the user’s information need, based on his/her 
request, accompanied by feedback, and the data representation (i.e. our measure 
of closeness P(Si\T)). Using Bayes’ rule the problem can be stated as estimat- 
ing the probability of relevance P(T) given user’s feedback <5 1 , . . . ,S n and the 
collection indexing [12,3,16]. 

2 The search target may be a single document, but it can as well be a number of 
documents covering a certain subject satisfying the user’s information need. 
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We write it down in the following iterative form, using the assumption that 
the (5 1 , . . . ,S n are conditionally independent given the target T : 



p new( T ) = p(r\5\... ,5 n ) 



P old (T) J~[" =1 P(S S \T) 
P(S\... ,6") 



(1) 



We distinguish the following factors that influence an interactive search session: 



1. The input provided by the user who is assumed to be reasonable in his/her 
query formulation and feedback. 

2. The current document representation. Within one search session, the index- 
ing of the collection is a static component of the model. 

3. The prior information about the relevance of documents in the collection. 

Below we describe our approach to indexing of a multimedia collection. 



2.2 Indexing: The Structure of the Association Matrix 

Documents in the collection and their conditional probabilities P(5i\T) can be 
visualised as a directed graph with objects i £ X as nodes and arcs with weights 
P(5i\T) connecting them. In this way each object is described by its associations 
with a number of other objects linked to it. We call such representation of the 
collection an association matrix , denoted M. 

Ideally we want the associations to refer to high-level semantics (e.g. coming 
from users’ judgements) which might not be achieved using lower-level features. 
Starting at the point when we do not have knowledge about the human per- 
ception of similarity, the associations need to be based on something different. 
We propose to bootstrap the process by basing the associations on a similarity 
measure on lower- level features, such as colour, texture, or shapes present in an 
image (e.g. as used in [7,16]). Typically such similarity measures take values in 
]R or 1RT and thus cannot be directly used as an initial estimate for P(5i\T). 

In our model we take pair wise similarities based on, e.g., pictorial features, 
and we are looking for an appropriate transformation to obtain probabilities. 
Any increasing function with the domain IR and the range [0,1] could suit. 
When deciding the probabilities in our model, we would like to achieve equal 
emphasis of the alike similarities and obtain probabilities, uniformly distributed 
in [0,1]. The used transformation spreads the observations evenly on this inter- 
val according to their probability of occurrence and not the magnitude of the 
similarity measure. As a result it reduces the influence of outliers and preserves 
the scale of the similarities between documents and ‘improves the discrimination 
capabilities of the similarity measures’ [2]. Since a priori we cannot prefer some 
documents of the collection to others in the sense of the distribution of P(5i\T), 
the underlying similarities are assumed to be random values conform to the same 
probability distribution — the normal distribution. 

We transform the computed similarities by subtracting the sample mean 
and dividing by the sample standard deviation and then applying the standard 
normal cumulative distribution function, to obtain estimates of the probabilities 
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which are denoted by P(Si\T). The value of P(8i = 0|T) = 1 — P(8i = 1|T) 
obtained in this way can be interpreted as a P-value, the probability that a 
variable assumes a value greater than or equal to the observed one strictly by 
chance [18]. Thus by specifying some a such that P < a only significant pair- 
wise similarities and their corresponding P{5i\T) are taken into account, and the 
rest is replaced by an appropriate constant further denoted by p. When updating 
P(T) for each object in (1), P(Si\T) is substituted with p if it is below 1 — a. 
Here 1 — a serves as a cut-off threshold for the right tail of the distribution. For 
the rest of the paper the corresponding threshold for the left tail is set to zero. 
A pair of documents i,T having their P(5i\T) significant, are called neighbours. 

Keeping only neighbours for each element makes the association matrix 
sparse, which allows faster access to the data. In our experiments with an appro- 
priate/optimal choice of the cut-off threshold, depending in particular upon the 
size of the collection, the association matrix can grow as slowly as linear without 
the loss of the search quality. Pre-computed probabilities allow easy combina- 
tion of different modalities of otherwise hard to combine feature spaces, such as 
visual information from a shot and speech transcripts from spoken words [8]. 



3 Modelling Interaction for Retrieval 

The user feedback. During the search session, the current probability of an 
element to be the user’s search target P(T) is updated according to (1). Every 
document can be either relevant to the user’s information need or not, i.e. the 
events are disjoint: P{8j = 1|T) = 1 — P(<5* = 0|T). The objects that are not 
marked by the user as relevant, take part in the probability update as if they 
are explicitly rejected by the user. 

To ensure that in the lack of positive examples, the excessive (implied) neg- 
ative feedback does not bury the precious positive examples, the p is set in our 
experiments to a value in the interval (0, a), with the effect that in the ranked list 
of results the non-neighbours of negative examples do not precede the neighbours 
of known (if any) positive examples from the last iterations. 



New display for the next iteration. Upon updating P(T ), a new set of 
objects should be presented for relevance judgement, to receive new evidence 
from the user. The display update is an important part of the search process, 
since efficiency and quality of retrieval depend on it. Each iteration should bring 
the user closer to his/her target object. ‘Closer to the target’ may have various 
interpretations, such as: the posterior probability P(T) of the desired information 
object(s) tends to 1; or the target object approaches the top of the ranked list, 
etc. The goal of the search is not only to satisfy the users’ need, but to do it 
in few iterations and/or in a limited amount of time. In this paper we report 
experiments performed with the following display update strategies. 

Best-target strategy. Following probability ranking principle [12], P(T) is tre- 
ated as a score that the element receives during retrieval session. The next display 
set consists of (new) documents that have largest values of P(T). 
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The Best-target strategy is plausible for the user unfamiliar with content- 
based retrieval (thus, the majority of potential users). The screen always contains 
objects that are the neighbours of good examples marked by the user. The user 
is able to observe the immediate result of his/her action. It is not clear however, 
whether this approach converges the search to the target quickly enough. Cox 
et al. [3] report that the Best-target search occasionally gets stuck in an isolated 
‘island’ of non-relevant documents that are similar to each other only. 

Non- deterministic strategies. The Randomised display set consists of objects 
picked from the collection at random. Uniform sampling may give relatively good 
representation of the collection, which supposedly allows to find the relevant 
documents quickly. Sampling could be especially useful at the beginning of a 
search session, when the system has little knowledge about the user information 
need. When, after a number of iterations, the mass P(T) is concentrated on a 
small (relevant) subset of the collection, sampling the whole data becomes useless 
and may have negative effect on the search quality. To minimise this effect, 
Random-of-Best strategy makes the selection among those objects for which 
their probability to be the target increased since the last iteration, which are 
effectively neighbours of relevant examples, and/or not-neiglrbours of the non- 
relevant ones. Ideally, the number of elements of which P(T) increases should 
shrink on to the group of documents that satisfy the user’s information need. 



4 Experiments and Evaluation 

4.1 Interactive Experiment Setup 

We use video data provided in the framework of TRECVID. The videos are seg- 
mented into shots, and from each shot a representative key frame is extracted. 
Conditional probabilities for the association matrix are estimated using a gen- 
erative probabilistic retrieval model (see for detail [19]): 

1. M v using on Kullback-Leibler divergence as similarity measure for Gaussian 
Mixture Models built on pictorial data; 

2. M^, using language model-based similarity on text from speech transcripts; 

3. IM V , a run-time combination of the two modalities, which adds up the 
relevance scores achieved in both matrices. 

We conducted an empirical study on performance difference caused by the 
prior distribution of the probability of relevance. In order to provide a better 
than uniform prior probability of relevance, for a number of experiments the 
text from search topic descriptions serves as a query to match against the speech 
transcripts, using a language model [9]. In another version of the system the prior 
distribution is determined by the number of neighbours in the association matrix 
for each document, so that a document with many neighbours has higher chance 
to be displayed. This is useful when no prior information about the information 
need is available, for instance in un-annotated data, or when the query terms 
typed by the user, do not occur in the collection. 
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A retrieval session starts with browsing a display set of 12 key frames gener- 
ated by the prior distribution of P(T ), which might be based on an initial text 
query. The user does not have to provide an example query image. 

The documents in the ranked list are ordered by the decreasing probability 
of relevance. A standard TREC evaluation metric, mean average precision (MAP) is 
used as a measure of user’s satisfaction (see [5, Appendix]). Where questionable, 
signed rank test is used to determine if a difference in performance between two 
methods is significant. If not stated otherwise, the significance level is p < 0.05 

4.2 Automated Experiments 

In the series of experiments, referred to as automated the user input has been 
replaced with relevance judgements available from TREC assessors who played 
the role of a ‘generic user’. The experiments have been performed on a subset of 
the collection selected so that that half of the key frames was relevant to at least 
one of the 25 topics. The goal of such setup was to test the retrieval performance 
in our probabilistic framework, and to find optimum settings to be used in the 
experiments with real users. 

Values in the association matrix. Values of MAP after each iteration using 
two types of the association matrix and their combination, with the best found 
values of p , and two matrices with all pairs of probabilities, are plotted in Fig. 1. 
Combining visual and text modalities results in better performance than using 
either separately. In the runs where text from the topics description is used as 
the query (Fig. lb), the difference in average precision is smaller, which is an 
expected result: the shots that are relevant because of the initial query text are 
put on top of the ranked list, and further search depends on this prior distribution 
by the nature of the Bayesian approach. 





Fig. 1. MAP for different matrices vs. all pairs, without (a) and with (b) text-based 
prior distribution. In (b) the difference between curves (1) and (2) is not significant. 



In the automated experiments the threshold 1 — a is such that on average 
3% of possible values need to be stored. Keeping only significant P(Si\T) in 
fact improves the search quality compared to the complete set of conditional 
probabilities both with visual and text-based matrices. This suggests that the 
probabilities replaced with the constant p are indeed far from true similarities. 
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The display update strategies. The Best-target display update with an ad- 
hoc tuned value of p offers great improvement over iterations, both when using 
the text priors and not. By making sure that the user does not see the same 
object twice, the danger of getting stuck in a local maximum is eliminated. 

The two non- deterministic strategies perform not so well, especially when 
prior text information is used. The Non-deterministic methods perform on aver- 
age 10 to 15 percent better if negative examples are ignored during the update 
of P(T). As expected, the combination of Randomised and Best-target strate- 
gies (Random-of-Best) did better than the ‘pure’ Randomised. Still, uniform 
sampling of the more relevant subset of it, as done in Random-of-best, cannot 
beat the deterministic Best-target method. Sampling according to the estimated 
distribution of P(T) might be a better option. In Fig. 2 the best-performing 
combination is plotted for each display update strategy. 





Fig. 2. MAP for different display update strategies without (a) and with (b) the prior 
text information. 



The prior distribution based on text from the query description and words 
from speech transcripts provides overall better performance. Nevertheless, hav- 
ing little or no a priori information does not necessarily mean poor performance: 
Curve 1 in Fig. la for the method with no prior information available, reaches 
numbers comparable with the corresponding curve in Fig. lb. 

4.3 Live Experiments 

In the live experiments, the search tasks have been performed by real users 3 . 
The data set contained about 32 000 key frames taken from 60 hours of news 
videos. We found high agreement between real users feedback and TREC relevance 
judgements (average among runs 75%), so our automated experiments can be 
viewed as a good approximation to real life (see [17] for an analysis of agreement 
between TREC assessors). 

The set-up is similar to the automated experiments using the Best-target 
display update schema and text-based prior distribution P{T). The user was 
allowed to see key frames (images), and not the corresponding videos. Only 
positive feedback from the user was taken into account. The resulting MAP at the 

3 2 groups of 3 users to test 3 systems. All users are students of University of Twente 
aged between 19 and 26. Each search task took at most 15 minutes. 
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end of the live experiment evaluated by TREC is 0.245. For this run, 78% of the 
shots selected by the user were relevant according to TREC. At the same time, 
48% of the relevant shots that have been displayed, were missed by our users. In 
the experiment that showed the user random screens (MAP 0.026), the number of 
missed shots was much lower (31%), as well as agreement with TREC (55%). The 
relevant documents are missed partially due to the fact that the user saw still 
frames , and not the videos themselves, but the difference in numbers between 
the runs suggests that relativeness of the users’ judgements (the user selects best 
of what is available and two users do not always agree) plays a role, too. 



4.4 Scalability of the Approach 

The term ‘scalability’ denotes not only the possibility to run a retrieval system 
on a larger collection. The ability of a retrieval system to produce answers to 
the user’s queries in a reasonable amount of iterations is at least as important. 

We ran a number of automated experiments on a system consisting of 32 000 
key frames from the TRECVID 03 data. After 48 iterations on the large collection, 
MAP of the best automated run is 0.44, compared to 0.58 achieved on the small 
collection. Note that half of the small collection were key frames relevant to one 
of the topics, whereas in the large collection only 6.5% of the key frames was 
relevant to one of the 25 topics. The execution time on the large collection, which 
is eleven times bigger than the small one, increased by factor 5 to 6. 

5 Conclusions and Future Work 

We found that feature normalisation and ‘refinement’ by way of replacing non- 
significant similarities with a constant which we propose, results in better search 
quality in both investigated feature spaces, text-based and visual-based. Using 
the association matrix as an index structure enables efficient combination of 
different modalities, such as visual information from key frames and transcripts 
of the speech occurring in video shots. Combining text and video (in the form 
of key frames) has positive effect on retrieval. 

Organising the objects in a multimedia collection using the association matrix 
allows scalable implementation which is hard to achieve otherwise: computing 
similarities ‘on the fly’ is expensive in the sense of access time and/or computa- 
tion effort, whereas keeping all pre-computed similarities is impractical from the 
storage point of view. Keeping only the significant similarities allows building an 
interactive content-based retrieval system that provides fast response time and 
good search quality on rather large image or video collections. 

Text, in the form of speech transcripts of videos or annotations, is an impor- 
tant source of information about the multimedia content. When available, the 
text data should be used in combination with pictorial features, to improve the 
search results. 

In the future we want to have the probabilities stored in the association 
matrix, to be updated by utilising the relevance judgements obtained from the 
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user’s feedback. We are also going to investigate how to dynamically change the 
search strategy depending on user-system performance. 
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Abstract. Much previous work on image retrieval has used global fea- 
tures such as colour and texture to describe the content of the image. 
However, these global features are insufficient to accurately describe the 
image content when different parts of the image have different charac- 
teristics. This paper discusses how this problem can be circumvented by 
using salient interest points and compares and contrasts an extension 
to previous work in which the concept of scale is incorporated into the 
selection of salient regions to select the areas of the image that are most 
interesting and generate local descriptors to describe the image char- 
acteristics in that region. The paper describes and contrasts two such 
salient region descriptors and compares them through their repeatability 
rate under a range of common image transforms. Finally, the paper goes 
on to investigate the performance of one of the salient region detectors 
in an image retrieval situation. 



1 Introduction 

Much previous work in the field of content based retrieval has been based around 
the concepts of using global descriptors to describe the content of the image. 
More recently researchers have begun to realise that global descriptors are not 
neccessarily good when it comes to describing the actual objects within the 
images and their associated semantics. Two approaches have grown from this 
realisation; firstly approaches have been developed whereby the image is seg- 
mented into multiple regions, and separate descriptors are built for each region; 
and secondly, the use of salient points has been suggested. 

The first approach has been demonstrated to work [1], although it has a 
large problem - that of how to perform the segmentation. Over the years many 
techniques for performing image segmentation have been suggested, although 
none really solve the problem of linking the segmented region to the actual object 
that is being described. Indeed, this shows that the non-naive segmentation 
problem is not just a bottom-up image processing problem, but also a top-down 
problem that requires knowledge of the true object, before it can be successfully 
segmented. 



P. Enser et al. (Eds.): CIVR 2004, LNCS 3115, pp. 317-325, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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The second approach avoids the problem of segmentation altogether by choos- 
ing to describe the image and its contents in an altogether different way. The 
use of saliency in computer vision has become quite widespread in recent years. 
Saliency is often used to provide the basis for a visual attention mechanism that 
reduces the need for computational resources [2]. Historically, saliency was de- 
scribed by the term ‘interest point detectors’, but use of the term ‘saliency’ has 
come about from the large amount of psychology-based work on selective visual 
attention. By using salient points within an image, it is possible to derive a com- 
pact image description based around the local attributes of the salient points. A 
number of different methods for finding salient points have been suggested, from 
the simple Harris’ & Stephens [3] corner detector, to wavelet based approaches 
[4,5,6], to methods based around image entropy [7,8]. Many previous approaches 
to using salient points have generated feature-vectors from pixel data in fixed- 
sized regions around the salient point, usually a 3x3 or 9x9 pixel neighbourhood 
centred on the point [5], although some of the modern state-of-the-art detec- 
tors find affine invariant regions and generate descriptors from within the region 
[9,10,11]. This paper compares and contrasts an extension to previous work in 
which the concept of scale is used in the selection of salient points (or rather 
salient regions), and the pixel content of the entire region content to build the 
feature vector of the local descriptor. 

2 Salient Regions 

2.1 Scale Saliency 

The Scale-Saliency algorithm developed by Kadir and Brady [8,7] was based 
on earlier work by Gilles [12]. Gilles investigated salient local image patches 
or ‘icons’ to match and register two images (specifically aerial reconnaissance 
images). Gilles suggested that by extracting locally salient features from the 
pair of images and matching these, it would be possible to estimate the global 
transform between the two images. Gilles defined saliency in terms of local sig- 
nal complexity or unpredictability. More specifically, he suggested the use of 
Shannon Entropy of local attributes to estimate the saliency. Basically, image 
segments with flatter intensity histogram distributions 1 tend to have higher sig- 
nal complexity and thus higher entropy. Gilles method only worked at a single 
scale, and picked single salient points, rather than salient regions. 

Kadir and Brady modified Gilles original algorithm to make it perform well 
on images other than those from aerial reconnaissance imagery. Essentially they 
changed the algorithm so that it detected salient regions at multiple scales by 
looking for self-similarity across scales. The modified algorithm located circular 
patches of the original image that were considered salient. The size of the patch 
was determined automatically by the multi-scale additions to Gilles algorithm. 

1 Kadir and Brady [8] note that the method is not limited to the intensity histogram 
and that it is equally possible to use a histogram from a different descriptor, such 
as colour or edge strength. 
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Fig. 1. (a) Salient regions found by the Scale-Saliency algorithm; (b) Salient regions 
found by from peaks in a difference-of-Gaussian pyramid 



In addition Kadir and Brady developed a simple clustering algorithm to group 
together features within the R 3 space that have similar x and y location, and 
scale. Figure 1(a) illustrates the results of applying the algorithm to an image. 



2.2 Peaks in the Difference-of- Gaussian Pyramid 

We take the idea of using peaks in a difference-of-Gaussian pyramid from the 
work of Lowe [13,14] on object recognition using keypoints. Lowe has shown that 
by searching a difference-of-Gaussian pyramid for local peaks, both spatially and 
across scale, it is possible to select points robust to a range of projective transfor- 
mations. The difference-of-Gaussian closely approximates the scale-normalised 
Laplacian-of-Gaussian [15,13], <r 2 V 2 G. Mikolajczyk [16] showed that the minima 
and maxima of cr 2 V 2 G produced the most stable interest points when compared 
to a range of other operators. Figure 1(b) illustrates the results of finding peaks 
in a difference-of-Gaussian pyramid. 



2.3 Comparison of Salient Region Methods 

Both of the methods for selecting salient regions described above are quite sim- 
ilar. For example, when the response of a difference-of-Gaussian filter is large, 
we would also expect the entropy taken over the same area as the filter to be 
large. Note that the converse is not always true though - high entropy does not 
necessarily mean that there would be a large difference of gaussian response. 
This is illustrated in Figure 2. 

One problem with entropy is that it is very sensitive to noise. This is es- 
pecially so at small scales, where there are relatively few pixels to sample and 
estimate the probability density function from, in order to estimate the entropy. 
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Scale 





Fig. 2. Entropy and difference-of- Fig. 3. Response of Entropy and 
Gaussian (ratio of a's = 1 : 1.6, smaller difference-of-Gaussian functions to a 
a is shown on the top x-axis) response constant signal with increasing amounts 
versus scale to a one-dimensional signal of zero-mean additive Gaussian noise. The 
as illustrated in the top diagram. The DoG response stays stationary, whilst the 
centre of the DoG and Entropy mask are Entropy response increases with noise 
kept at a constant position relative to 
the signal (shown by the dashed line). 

The graph illustrates how the response 
functions behave in a similar manner 
across scale-space 



The difference-of-Gaussian is much less sensitive to noise due to the smoothing 
effect of the Gaussians. This is illustrated in Figure 3. 

The remainder of this section is devoted to objectively comparing the stability 
of the two salient region detectors. 



Repeatability. We take the measure of repeatability of interest points from 
Schmid et al [17]. The concept of repeatability is described below together with 
some results. 

Repeatability Criterion. Repeatability is a measure of how independent an inter- 
est point detector is to the imaging conditions, i.e. camera parameters - position 
relative to the scene, zoom, etc. 3D points detected in one image should also 
be detected at aproximately the same locations in subsequent images. Given a 
point X in 3D space and two projection matrices, P\ and P 2 , the projections of 
X in two images I\ and / 2 are given by p\ = P\X and P 2 = P 2 X respectively. 
The point p \ , detected in image I \ , is repeated if the corresponding point P 2 is 
detected in image / 2 . In order to estimate the repeatability, a unique relation be- 
tween the points p\ and p 2 has to be found. In the case of a planar scene, points 
in one image are related to points in a second image by a planar homography: 
P2 = Hpi. 
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The percentage of points that are repeated with respect to the total number of 
detected points is called the repeatability rate. In general, a point is not repeated 
at exactly the same position as given by Hp\, but in a small neighbourhood of 
that point. Denoting the size of the neighbourhood by e , we can define the e- 
repeat ability. Interest points that cannot be observed in both images will corrupt 
the repeatability measure, thus only points in the common part of the scene are 
used to calculate the repeatability. The common part of the scene is defined 
by the lromography, thus points pi and P 2 which lie in the common parts of 
images I\ and I 2 are defined by {pi} = {pi\Hpiel 2 } and {^ 2 } = {P 2 \H 1 P 2 ^Ii}- 
The set of point pairs (pi , P 2 ) that correspond within an e-neighbourhood, is 
D(e) = {(p 2 ,pi)\dist(p 2 , Hpi) < e}. 

As the number of detected points in the two images may be different, the 
repeatability rate is defined as: 



r(e) 



\D(t) I 

mm(|{pi}|,|{p 2 }|)’ 



(1) 



Repeatability Results. Using the repeatability criterion, we investigated the ro- 
bustness of the two salient region descriptors to image rotation and scaling. The 
rotation and scaling were performed digitally, using bilinear interpolation. As 
a baseline, we also calculated the repeatability of the well-known Harris corner 
detector (using a [-2 -10 12] kernel), and an improved version of the Harris 
detector that calculates the derivatives more precisely by replacing the [-2 -1 0 
1 2] kernel with one calculated from the derivatives of a Gaussian (a = 1.0). 

Figure 4(a) illustrates the results of repeatability against rotation angle, av- 
eraged over all of the images in the dataset, and Figure 4(b) illustrates the 
variation in repeatability over a range of image scales, again averaged over all 
the images in the dataset. The results show that the salient regions detected by 
finding peaks in the difference-of-Gaussian pyramid are by far the most stable 
to both rotation and scaling. The salient-scales algorithm performs more-or-less 
on a par with the Harris detector. Unfortunately, whilst the salient-scales algo- 
rithm should be robust to both scaling and rotation, in practice it is affected 
by discretisation of the digital raster, especially at small scales. Also, we have 
found that the clustering part of the salient scales algorithm does little to help 
its stability. 



3 Query by Image Content Using Salient Regions 

In previous work by Sebe et al [5] , the use of salient point detectors for content- 
based rerieval was shown to have better performance than when using global 
descriptors. In this section we describe a new metric for measuring the perfor- 
mance of content-based retrieval based on salient points, and illustrate it with 
some preliminary results that show that the performance when using salient 
regions is indeed better than when using global descriptors. 

In order to facilitate the testing of the the use of salient regions for content- 
based retrieval, we have developed a system that returns the N closest matches 
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Fig. 4. Repeatability rate for image rotation (a), and for scale change (b). e = 1.5 in 
both cases 



to a given query image. The system enables queries to be made using either 
global descriptors or a descriptor based on salient regions. Following Sebe et 
al , we fix the number of salient regions to 50 per image. In the case of global 
descriptors, the distance between two images, I\ and / 2 , is given by the euclidean 
distance between the feature descriptors, Fi and F 2 : 



De(F!,F 2 ) = |Fi— F 2 | 




(2) 



where K is the number of elements in the feature descriptors. In the case of 
matching using salient regions, the distance between two images is given by a 
linear summation of the closest matching feature vector in the second image for 
each feature vector in the first image. Denoting the set of M feature vectors in 
images I\ and / 2 as {Fi} and {F 2 }: 



M 

D sa iient({F 1 },{F 2 }) = ^min fc (D E ({F 1 },-,{F 2 } fc )), (3) 

3 

where {Fi}j refers to the jth feature vector of image I\ and {F 2 }fc refers to the 
fctlr feature vector of image J 2 . 



3.1 Semantic Relevance 

The problem with global descriptors is that they cannot fully describe all parts 
of an image having different characteristics. The use of salient regions aims to 
avoid this problem by developing descriptors that do capture the characteristics 
of each part of the image. Given this aim, it should not be unreasonable to 
expect that an image description generated from salient regions will be better 
than an image described wlroly by a global descriptor. In order to test this we 
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Table 1 . Averaged Semantic Relevance for queries based on the Rank 1 result image 
and the closest 5 result images 





Rank 1 Result Image 


1 Averaged Top 5 Result Images 


Feature Type 


DoG Peaks 


Global 


DoG Peaks 


Global 


RGB Histogram 


42.1% 


37.6% 


51.0% 


45.6% 


HSI Histogram 


45.2% 


36.9% 


50.4% 


49.6% 


Mono Histogram 


31.6% 


36.9% 


42.3% 


45.0% 


HU Moment 


41.1% 


22.6% 


52.4 % 


39.5% 


RGB Colour Moment 


33.7% 


24.1% 


41.9% 


35.4% 


HSI Color Moment 


34.9% 


30.2% 


43.5% 


40.5% 



have developed a metric that uses semantically marked images as ground-truth 
against the results from our retrieval system. 

The University of Washington Ground Truth Dataset [18] contains a large 
number of images that have been semantically marked up. For example an image 
may have a number of labels describing the image content, such as “trees”, 
“bushes”, “clear sky”, etc. Given a query image with a set of labels, we should 
expect that the images returned by the retrieval system should have the same 
labels as the query image. Let A be the set of all labels from the query image, 
and B be the set of labels from a returned image. We then define the semantic 
relevance, R , of the query to be: 



R = 



\AC\B\ 

^4T 



(4) 



This implies that if all the labels in set A exist in set B then the semantic 
relevance will be 100%, and if only half of the labels in set A exist in set B then 
the semantic relevance will be 50%. 



3.2 Results 

We used all of the semantically marked images from the Washington dataset 
to form our test set. Taking each image in the test set in turn as a query, we 
calculated the distance to each of the other images in the test set using a range 
of feature types. We then calculated the semantic relevance for the rank one 
image (the closest image, not counting the query image), and we also calculated 
the averaged semantic relevance over the closest 5 images. The results of this are 
shown in Table 1. The table shows that the use of salient regions does indeed 
produce better semantic relevance than using global descriptors, although we 
believe that there is still scope for improvement of the semantic relevance from 
the salient regions. We believe that using a single feature type to describe a 
salient region (or indeed the whole image) is not sufficient. For example, the 
RGB histogram that represents a ’’blue sky” semantic label may be very similar 
to the histogram representing the ’’water” label. In our future work we hope 
to show it is possible to improve the semantic relevance of queries using salient 
regions by fusing multiple feature descriptors. Figure 5 illustrates the differences 
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(a) (b) 



Fig. 5. Example Retrieval: (a) shows the results of a query using the Difference of 
Gaussian salient region method, and (b) shows the results of the same query with the 
Global method. In both cases, RGB Histograms are used as the feature descriptor and 
the first image shown is the query image 



between a query based on a global RGB-Histogram descriptor, versus multiple 
RGB-Histogram descriptors based around salient regions found from the peaks 
in the difference-of-Gaussian pyramid. 



4 Conclusions and Future Work 

In this paper, we have illustrated the concept of using peaks in a difference-of- 
Gaussian pyramid to select scale-invariant salient regions. We have shown that 
peaks in the difference-of-Gaussian pyramid are robust to a range of transfor- 
mations, and that they perform better than an alternative approach to finding 
salient regions based on image entropy. 

We have also demonstrated the concept of using salient regions for content- 
based retrieval. We have introduced a new metric, which we have termed se- 
mantic relevance, for the measurement of the relevance of a semantically marked 
result image from a semantically marked query image. 

Our results have shown that the use of salient regions for content-based 
retrieval produces better semantic relevance than global descriptors. However, 
we note that it should be possible to improve these results even more by the use 
of better feature descriptors. 

As previously mentioned, our future plans are to use the fusion of multiple 
features to try and improve the semantic relevance. We also plan to extend our 
system to use a better distance metric, such as the Mahalanobis distance, Dm- 
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Abstract. We have carried out a detailed evaluation of the use of tex- 
ture features in a query-by-example approach to image retrieval. We 
used 3 radically different texture feature types motivated by i) statisti- 
cal, ii) psychological and iii) signal processing points of view. The fea- 
tures were evaluated and tested on retrieval tasks from the Corel and 
TRECVID2003 image collections. For the latter we also looked at the 
effects of combining texture features with a colour feature. 



1 Introduction 

Texture is a key component of human visual perception. Like colour, this makes 
it an essential feature to consider when querying image databases. Everyone can 
recognise texture but, it is more difficult to define. Unlike colour, texture occurs 
over a region rather than at a point. It is normally defined purely by grey levels 
and as such is orthogonal to colour. Texture has qualities such as periodicity 
and scale; it can be described in terms of direction, coarseness, contrast and so 
on [1], It is this that makes texture a particularly interesting facet of images 
and results in a plethora of ways of extracting texture features. To enable us to 
explore a wide range of these methods we chose three very different approaches 
to computing texture features: The first takes a statistical approach in the form 
of co-occurrence matrices, next the psychological view of Tamura’s features and 
finally signal processing with Gabor wavelets. 

Our study is the first to focus an evaluation of texture features on the whole 
image, and to tailor features for optimum retrieval performance in this context. 
The majority of original papers devising or evaluating texture features used 
classification or segmentation tasks to measure performance [2, 3, 4, 5]. Both of 
these tasks are significantly different to the problems faced in image retrieval 
where one looks at generic queries for an entire picture. Real pictures are made 
up of a patchwork of differing textures rather than the uniform texture images 
often used in studies, such as the ones taken from Brodatz’s photo book [6]. 
To that effect we suggest encoding texture in terms of joint histograms of low 
dimensional texture characteristics over the image in the same way 3D colour 
histograms are computed, we have called this a Tamura image. Throughout our 
work we have considered how best to cope with varying image sizes, scales, 
formats and orientations. 



P. Enser et al. (Eds.): CIVR 2004, LNCS 3115, pp. 326-334, 2004. 
(c) Springer- Verlag Berlin Heidelberg 2004 
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In the next section we look at the features we have chosen and how they 
are computed. Sect. 3 then describes the image libraries and similarity measures 
we used for evaluation. Sect. 4 presents our initial results on a training set and 
suggests modifications and parameters that we found gave the best retrieval per- 
formance. A larger performance comparison is carried out on the TRECVID2003 
data set. Finally, Sect. 5 concludes the paper and outlines further work. 

2 Texture Features 

2.1 Co-occurrence 

Statistical features of grey levels were one of the earliest methods used to clas- 
sify textures. Haralick [7] suggested the use of grey level co-occurrence matrices 
(GLCM) to extract second order statistics from an image. GLCMs have been 
used very successfully for texture classification in evaluations [2]. 



Table 1 . Features calculated from the normalised co-occurrence matrix P(i,j) 



Feature 


Formula 


Energy 




Entropy 




Contrast 




Homogeneity £. £ . 1+ j W, 



Haralick defined the GLCM as a matrix of frequencies at which two pixels, 
separated by a certain vector, occur in the image. The distribution in the matrix 
will depend on the angular and distance relationship between pixels. Varying 
the vector used allows the capturing of different texture characteristics. Once 
the GLCM has been created, various features can be computed from it. These 
have been classified into four groups: visual texture characteristics, statistics, 
information theory and information measures of correlation [7,3]. We chose the 
four most commonly used features, listed in Table 1, for our evaluation. 

2.2 Tamura 

Tamura et al took the approach of devising texture features that correspond to 
human visual perception [1]. They defined six textural features (coarseness, con- 
trast, directionality, line-likeness, regularity and roughness) and compared them 
with psychological measurements for human subjects. The first three attained 
very successful results and are used in our evaluation, both separately and as 
joint values. 

Coarseness has a direct relationship to scale and repetition rates and was 
seen by Tamura et al as the most fundamental texture feature. An image will 
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contain textures at several scales; coarseness aims to identify the largest size at 
which a texture exists, even where a smaller micro texture exists. Computation- 
ally one first takes averages at every point over neighbourhoods the linear size 
of which are powers of 2. The average over the neighbourhood of size 2 k x 2 k at 
the point (x, y) is 



x+2 k ~ 1 -ly+2 k ~ 1 -l 

A k (x,y)= /(bi)/ 22fc • 

i=x—2 k ~ 1 j—y— 2 fc — 1 

Then at each point one takes differences between pairs of averages correspond- 
ing to non-overlapping neighbourhoods on opposite sides of the point in both 
horizontal and vertical orientations. In the horizontal case this is 

E k ,h(x,y) = \A k (x + 2 k ~ 1 ,y) - A k (x - 2 k ~ 1 ,y)\ . 

At each point, one then picks the best size which gives the highest output value, 
where k maximizes E in either direction. The coarseness measure is then the 
average of S opt (x,y) = 2 kopt over the picture. 

Contrast aims to capture the dynamic range of grey levels in an image, 
together with the polarisation of the distribution of black and white. The first is 
measured using the standard deviation of grey levels and the second the kurtosis 
a 4 . The contrast measure is therefore defined as 

F C on = <x/(a 4 )" where a 4 = y 4 /cr 4 , 

y 4 is the fourth moment about the mean and a 2 is the variance. Experimentally, 
Tamura found n = 1/4 to give the closest agreement to human measurements. 
This is the value we used in our experiments. 

Directionality is a global property over a region. The feature described does 
not aim to differentiate between different orientations or patterns, but measures 
the total degree of directionality. Two simple masks are used to detect edges in 
the image. At each pixel the angle and magnitude are calculated. A histogram, 
H c i 1 of edge probabilities is then built up by counting all points with magnitude 
greater than a threshold and quantising by the edge angle. The histogram will 
reflect the degree of directionality. To extract a measure from Hd the sharpness 
of the peaks are computed from their second moments. 

Tamura Image is a notion where we calculate a value for the three features 
at each pixel and treat these as a spatial joint coarseness-contrast-directionality 
(CND) distribution, in the same way as images can be viewed as spatial joint 
RGB distributions. We extract colour histogram style features from the Tamura 
CND image, both marginal and 3D histograms. The regional nature of texture 
meant that the values at each pixel were computed over a window. A similar 3D 
histogram feature is used by MARS [8] . 
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2.3 Gabor 

One of the most popular signal processing based approaches for texture feature 
extraction has been the use of Gabor filters. These enable filtering in the fre- 
quency and spatial domain. It has been proposed that Gabor filters can be used 
to model the responses of the human visual system. Turner [9] first implemented 
this by using a bank of Gabor filters to analyse texture. A bank of filters at dif- 
ferent scales and orientations allows multichannel filtering of an image to extract 
frequency and orientation information. This can then be used to decompose the 
image into texture features. 

Our implementation is based on that of Manjunatlr et al [10,11]. The feature 
is computed by filtering the image with a bank of orientation and scale sensitive 
filters and computing the mean and standard deviation of the output in the 
frequency domain. 

Filtering an image I(x,y ) with Gabor filters g mn designed according to [10] 
results in its Gabor wavelet transform: 

W mn (x,y) = J I(x, y)g^ nn (x - xi, y - yi)dxidyi 

The mean and standard deviation of the magnitude | W mn | are used to for the 
feature vector. The outputs of filters at different scales will be over differing 
ranges. For this reason each element of the feature vector is normalised using 
the standard deviation of that element across the entire database. 



3 Experimental Set Up 

We followed a two-stage approach: Initial evaluation and modifications to the 
features were tested using a carefully selected subset of the Corel image li- 
brary and the vector space similarity measure. We then ran larger tests on the 
TRECVID2003 data using the /c-nearest neighbour measure (fc-nn). We have a 
baseline for evaluation from previous work with the TREC dataset for which 
fc-nn has consistently proved the best retrieval method. 

Image Collections. We selected 6,192 images from the Corel collection 
to give 63 categories that were visually similar internally, but different from 
each other [12]. A set of 630 single-image category queries was executed to test 
performance across all categories. Relevance judgments on the retrieved images 
were based on the categorisation. The results shown in Section 4 are the mean 
average precision (m.a.p.). 

A second larger image collection was used to give a more realistic perfor- 
mance comparison. This comprised of 32,318 key-frames from TRECVID2003 
collection [13]. The search task specified for TRECVID2003 consisted of 25 top- 
ics, for each topic a few example images were given as a query. The published 
relevance judgments for these topics were used to evaluate the retrieval perfor- 
mance for different features and combinations of features. 
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Similarity Measures. Distances between feature vectors were calculated 
using the Manhattan metric. The resultant distances were then median nor- 
malised to give even weighting when combined. The plain vector space model 
was used for retrieval on the Corel data set as these involved only simple 1-image 
queries. 

For querying the TREC data a version of the distance weighted fc-nn ap- 
proach was used [14], with k = 40. Positive examples (P) are supplied as the 
query and negative examples ( N ) randomly selected from the collection. To 
rank an image i in the collection we identify those images in P and N that are 
amongst the fc-nearest neighbours of i. Using these neighbours we determine the 
dissimilarity: 

EneiV d_1 (b n ) 

E p& p d M) 

4 Evaluation and Results 

For each feature we evaluated performance in the configuration described in 
Sect. 2. Ideas to improve performance were devised and evaluated. The general 
themes considered were how best to represent an entire image, how to accom- 
modate differing sizes and scale of images and how to cope with the regional 
qualities of textures. These evaluations were run on the Corel data. Paired t- 
tests were carried out to check whether results were statistically significant at 
a = 0.05. 

The best performing features from the initial evaluation were then tested on 
the TRECVID2003 data set. Tests were run with each texture feature combined 
with a high performing colour feature. 

4.1 Co-occurrence 

The two main variables when creating a GLCM are the number of quantisation 
levels and the vector. We decided to use four vector angles: 0, 45, 90, 135 and 
four distances. This could be used to calculate up to sixteen GLCMs. However, 
as the statistics are not invariant under rotation we also tried summing the four 
angles at each distance into a single matrix. GLCMs can be made symmetrical 
by including the reverse vector; symmetric and asymmetric matrices were tested. 
The number of quantisation levels dictate the size of matrix and density of the 
matrix. This may become a problem with small images or tiles. The effect of 
varying quantisation between 4 and 64 levels was tried. Features were calculated 
for whole and tiled images. 

Preliminary results showed that distances between 1 and 4 pixels gave the 
best performance. There was no significant difference between symmetrical and 
asymmetric matrices. Tiling of the image gave a large increase in retrieval which 
flattened out by 9 x 9 tiles. The results in Table 2 are for 7x7 tiles. Similarly 
increasing quantisation improves performance. The concatenated features (cat) 
gave better results at all points than the rotationally invariant summed matrices 
(sum). The best feature was homogeneity with a m.a.p. of 12.2%. 
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Table 2. Co-occurrence features — mean average precision retrieval 



Feature 


4 


Quantisation 
8 16 32 


64 


Energy: cat 


7.63% 


8.09% 


9.30% 


9.85% 


9.54% 


Energy: sum 


7.04% 


7.79% 


8.85% 


9.19% 


8.96% 


Entropy: cat 


8.12% 


9.22% 


10.41% 


11.09% 


11.36% 


Entropy: sum 


7.54% 


8.76% 


9.79 % 


10.37% 


10.70% 


Contrast: cat 


8.46% 


8.51% 


8.35% 


8.29% 


8.28% 


Contrast: sum 


7.83% 


7.85% 


7.65% 


7.59% 


7.57% 


Homogeneity: cat 


9.17% 


10.18% 


11.16% 


11.83% 


12.19% 


Homogeneity: sum 


8.50% 


9.52% 


10.39% 


10.93% 


11.26% 



4.2 Tamura 

When calculating standard Tamura features for whole or tiled images the main 
variable is the k value for coarseness. This effect of varying this, and the number 
of tiles, can be seen in Table 3. The dashes in the table are where the image size 
resulting from tiling meant that the k value was too large to be used because of 
the border needed. 

With the histogram features the main variable to evaluate was the window 
size. Coarseness can be calculated at a pixel level. However, both the direction- 
ality and contrast features operate over a region. A large window would smear 
the feature and lose resolution; conversely a small window may invalidate the 
statistical features, particularly if the directionality histogram is too sparsely 
populated. To evaluate this the features were run over several window sizes, 
creating a histogram for each feature. 

A little surprisingly initial results showed that increasing the k value for 
coarseness reduced the performance — the optimum value was 2. This may be 
due to the large borders necessary for higher values of k. However, it is more likely 
caused by the nature of textures in images and the way the algorithm averages 
the 2 fc values. There are unlikely to be textures with a coarseness of 64 or 32 
pixels in a normal image. The algorithm may still detect noise at this dimension, 
biasing the average value of the feature. A change to the algorithm was made so 
that it took the values of k rather than 2 k — effectively introducing a logarithmic 
scaling of the coarseness and giving less influence to the larger scales. This gave 
a significant increase in performance for the histogram, from 6.1% to 10.1%, but 
no improvement when applied to the standard feature. 

Performance of the directionality feature was poor. A detailed look at the 
operation of the algorithm showed that this was largely due to the sparse pop- 
ulation of the histogram and subsequent difficulty in calculating valid variance 
of its peaks. Several options for improvement were tried including calculating 
global variance of the histogram and using entropy. The latter gave a substan- 
tial improvement, from 6.6% to 9.7%, for the standard feature but negligible 
effect on the histogram. 
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Table 3. Tamura features — mean average precision retrieval 



Feature 


lxl 


Standard features 

Tiling 

3x3 5x5 7x7 


9x9 


Histogram features 

Window size 
2 4 8 16 


Contrast 


3.24% 


6.08% 


7.20% 


8.07% 


8.03% 


5.96% 


6.71% 


7.01% 


6.92% 


Directionality: peak finding 


2.91% 


4.16% 


5.02% 


5.79% 


6.64% 


5.39% 


5.59% 


5.57% 


4.93% 


Directionality: entropy 


2.74% 


5.35% 


7.45% 


8.93% 


9.73% 


4.89% 


4.37% 


5.24% 


5.43% 


Coarseness-2: 2 k 


4.42% 


8.33% 


9.48% 


9.87% 


9.91% 


6.90% 


5.99% 


6.09% 


6.01% 


Coarseness-3: 2 k 


3.54% 


7.57% 


8.79% 


9.19% 


9.02% 


6.52% 


5.85% 


5.96% 


5.83% 


Coarseness-4: 2 k 


3.49% 


7.16% 


7.68% 


6.98% 


— 


6.12% 


5.71% 


5.64% 


5.40% 


Coarseness-5: 2 k 


3.25% 


5.74% 


— 


— 


— 


— 


— 


— 


— 


Coarseness-6: 2 k 


2.92% 


— 


— 


— 


— 


— 


— 


— 


— 


Coarseness-2: k 


4.43% 


7.96% 


9.32% 


9.57% 


9.59% 


6.44% 


9.98% 


9.83% 


8.22% 


Coarseness-3: k 


3.91% 


7.50% 


8.92% 


9.10% 


8.94% 


5.68% 


10.08% 


9.24% 


7.93% 


Coarseness-4: k 


3.41% 


6.95% 


7.74% 


7.15% 


— 


8.81% 


9.33% 


8.12% 


7.67% 



Finally the combined marginal and 3D histograms were evaluated using a 
window size of 8, k of 3 and entropy directionality. In addition a combined 
feature vector of the 3 standard features was evaluated. The m.a.p. results were: 
marginal histogram 12.0%, 3D histogram 13.7% and standard 14.3%. All gave a 
significant improvement over the single features. 

4.3 Gabor 

Sect. 2.3 describes the generation of this feature. However, there still remain 
questions over how to apply it to a heterogeneous set of images. The problems 
of scale, varying size and so on apply. The evaluation in [10] was applied to fixed 
tiles extracted from the Brodatz album. In [11] the feature was used successfully 
with aerial photographs split into a large number of fixed size tiles and then 
querying to find individual tiles. We decided to evaluate the feature in two con- 
figurations across a range of scale and orientation values. The first scaled the 
filter dictionary to the size of the image. This should scale the response so that 
the same image of different size gives a similar value. The second approach was 
to use a fixed size filter and apply this to a sliding window over the image. 

Initial results showed that scaling the filter size gave much superior results to 
the sliding window approach. Tiling increased performance in a similar manner 
to the other features. The results shown in Table 4 are for 7 x 7 tiling. The 
best performance is obtained from just 2 scales and 4 orientations. This was 
unexpected as most literature recommends 4 scales and 6 orientations. Looking 
at the filtered images indicated that, as for Tamura, this may be due to noise at 
coarser scales. 

4.4 Evaluation Using TRECVID2003 Video Data 

A range of the best performing features were run on the TRECVID2003 data and 
evaluated using the published relevance judgments. The queries were run singly 
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Table 4. Gabor wavelets — mean average precision retrieval 



Scale 


Orientation 




3 


4 


6 


2 


13.1% 


14.0% 


13.9% 


3 


11.0% 


11.4% 


11.3% 


4 


10.8% 


11.4% 


11.2% 



and then combined with a colour histogram feature, HSV [12]. The results are 
shown in Table 5. For comparison some features used for previous evaluations 
[12] gave m.a.ps of: HSV 1.9%, convolution 2.2% and variance 1.7%; random 
retrieval would give 0.26%. 

In this evaluation the texture features performed extremely well in compar- 
ison with previous benchmarks. Gabor gave the best results, 3.9% or 15 times 
better than random retrieval. Of the Tamura features the best performing was 
the combined standard features. The top 3 performing texture features combined 
and giving a m.a.p of 4.22%. 

Combining with the HSV feature improved average retrieval performance in 
all cases, but at an individual query level the benefits were both positive and 
negative. It is interesting that using simple combination of features gives varying 
degrees of improvement; being able to choose the optimum combination based 
on the query would be beneficial. 



Table 5. TREC evaluation — mean average precision retrieval 



Feature 


Single 


Combined with HSV 


gabor-2-4 


3.93% 


4.31% 


co-occurence homogeneity 2.85% 


3.03% 


tamura standard all 


2.57% 


3.43% 


tamura CND 


1.65% 


2.72% 


tamura coarseness-2 


0.97% 


2.49% 



5 Conclusions 

We selected 3 different texture features, implemented and evaluated them. Both 
the evaluation and implementation focussed on query-by-example image retrieval 
rather than the usual classification task. 

This led to some novel modifications to the Tamura features. We found that 
looking for large scale coarseness degraded performance, so we limited the range 
and used a logarithmic scale. An improvement in directionality performance over 
small window sizes was achieved by using an entropy measure rather than taking 
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the second moments of the peaks. We also encoded the features in terms of joint 
histograms, the overall performance of these was similar to the standard features. 

To improve the retrieval with Gabor we scaled the filter size to that of the im- 
age, rather than using a fixed size filter. Rather unintuitively we found that fewer 
scales gave higher retrieval rates. Our tests of co-occurrence matrices showed a 
solid performance — as expected! 

Our evaluation with TRECVID2003 data showed that the top 3 texture 
features performed better than previously used colour features. Combination 
with a colour feature boosted retrieval performance in all cases. Overall we have 
demonstrated that we have produced robust texture features for image retrieval. 

We would like to carry out further evaluations on larger data sets, partic- 
ularly investigating the interaction of different feature combinations. Finally, 
texture features have an advantage over colour features in that performance 
should be the same for monochrome images. It would be interesting to perform 
an evaluation on a library of black and white pictures. 

Acknowledgement. This work was partially supported by the EPSRC, UK. 
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Abstract. This paper describes a content-based approach to improve image re- 
trieval effectiveness. First, we define two new measures for computing similarity 
among images based on color histograms, namely the dissimilitude distance DS* 
and the similarity distance E. The latter is incorporated into the exponentiation 
part of the Gibbs distribution and into the generalized Dirichlet mixture, while 
the former is compared to five similarity measures: L\, L 2 (Euclidean distance), 
E as well as Gibbs and Dirichlet distributions integrating the similarity measure 
E. Then, in order to overcome the limitations (and inappropriateness) of some 
previous information retrieval measures in evaluating the efficiency of an image 
retrieval process, three variants of a new effectiveness measure are proposed and 
experimented on an image collection for different similarity distances. 



1 Introduction 

Content-based image retrieval (CB1R) has emerged as an important area in computer 
vision, multimedia computing and databases. In order to make image databases easier 
to explore, we have developed a three-step process and a prototype for image mining 
and retrieval (see [5]). The main objectives of such a work are: (i) the development of 
an image feature extraction module, (ii) the design of a data mining tool dedicated to 
image clustering, classification and association rule generation, and (iii) the design of 
an image retrieval module which allows the identification of images that are similar to a 
given image query. In this paper, we limit ourselves to the third objective by describing 
the mechanisms put together to improve image retrieval effectiveness. 

The rest of the paper is organized as follows. Section 2 presents a brief background 
on image color representation. Section 3 presents a new similarity distance E for image 
retrieval as well as its integration into two separate distributions, namely Gibbs distri- 
bution (more precisely, its exponentiation part) and generalized Dirichlet mixture [ 1 ]. A 
new retrieval measure is defined in Section 4 while details about the experimentation of 
our solution to improving image retrieval effectiveness is provided in Section 5. 

2 Color Spaces 

Most image retrieval systems follow the paradigm of representing images using a set of 
features, such as color, texture, shape and layout. Among these features, color is the most 
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frequently used visual property in content-based image retrieval because it is relatively 
robust, and invariant with respect to image size and orientation. 

It is known that the RGB space is not perceptually uniform in the sense that color 
differences captured by the Euclidean distance, for example, in the three-dimensional 
RGB space do not correspond to color differences as perceived by humans. The CIE 
( Commission Internationale de I’Eclairage ) has then defined two perceptually uniform 
or approximatively-uniform color spaces L*a*b* and L*u*v*. Further, the L*C*H* 
(Lightness, Chroma, and Hue) and L*t*9* ( t = Chroma and 9* = Hue) color spaces have 
been defined as derivatives of L*u*v* and L*a*b* [4]. 

In our work we use 3-D L*a*b* and L*C*H* color spaces to represent and extract 
color properties of images. Any color in L*a*b* space is represented in a cubic coordinate 
system of axes L* , a* , and b* . The mapping from L* a* b* to L*C* H* can be expressed in 
terms of polar coordinates with the perceived lightness and the psychometric correlates 
of chroma and hue angle using the following formula: 

C* b = \/ a* 2 + b* 2 and H* b = tan -1 (1) 

To get a good precision with a reasonably fair execution time, we apply Wand’s 
quantization method [8] to 3-D color histogram L*C*H* . We divide the hue angle H* 
is divided into 17 colors (k = 0, 2, • • • , 16), and each color is then split into 12 chroma 
(n = 0, 2, • • • , 11) and 15 lightness values (to = 0, 2, • • • , 14). For white and black 
images, color is split into 15 different lightness values. 

Finally, each histogram is divided into 3075 1 color bins and represented by a vector 
v = (Fo,o,Oi ( / o,o,ii * * * ,Vk,m,n,--- ,^ 16 , 14 , 11 ) where V k ,„ hn represents a color bin 
and stands for the percentage of pixels having the color, lightness and chroma in the 
quantization intervals k A h, m A l and n A c. 



3 Similarity Color Distance 

Color-based similarity analysis can be conducted using either color vectors or color 
histograms and bins. Moreover, it can be conducted using either similitude, dissimilitude 
or both of them. 

In case of similitude analysis, a simple metric distance L q such as L \ (city-block, 
q = 1) or L 2 (Euclidean distance, q — 2) can be used. However, L\ and L 2 are not 
appropriate for the identification of dissimilitude. 

L 9=(E |C x (c)-C y (c)r) (1/9) ( 2 ) 

C 

As opposed toLi and L > distances which compute the difference between two histograms 
w.r.t. to color c, a metric called histogram intersection [7] is defined as the common 
proportion of color c in two histograms. 

In this section we define a new similarity distance which takes into account both the 
dissimilitude and the similitude of two images X and Y with respect to a set of colors. We 

1 = (17 x 15 x 12) + 15. 
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first present a dissimilitude distance named DS*, and then a similarity distance called 
E. Both distances are semi-metric ones. Finally, we integrate E into two distributions, 
namely Gibbs random field model and generalized Dirichlet mixture. 



3.1 Dissimilitude Distance 

We define the dissimilitude distance DS* between two images X and Y with respect 
to color c as equal to D c + L c where D c is an indicator of a potential absence of color 
c in one of the images, and L c is the difference between the proportions of color c in 
images X and Y. The former helps discard some images that do not share the same set 
of colors with the query image. 

= |(^(c) - W»)| and ft - { £ V ^ (3, 



where V x (c) and V ' (c) are the proportions of color c in histograms related to images 
X and Y respectively. It turns out that the dissimilarity distance D c is equivalent to L c 
(e.g., L \ ) when color c is present in both histograms. 

Based on the L*C*H* color histogram, the color c corresponds to a color bin at 
coordinates k, to, and n. Using equation 3, the dissimilitude distance of a color bin and 
a color histogram can be expressed respectively by: 




2 X Tfc,m,n if Lk t m,n — max {^k^m,n’ ^ k,m,n ) 

Lk, m ,n otherwise. 



(4) 



16 14 11 

^* = (EEEKJ’) ( 5 ) 

k — 0 m — 0 n — 0 

Figures 1, 2 and 3 in Section 5 show the image retrieval output when Li, L 2 and 
DS* distances are used respectively. The leftmost top image on each figure represents 
the image query while the retrieved images are displayed by the system in a relevance 
ranking sequence (from top to bottom and left to right). One can see that DS* leads to a 
smaller number of false alarms (i.e., number of irrelevant images included in the answer 
set) than L \ and L 2 . 

3.2 Similarity Distance 

Similarity between two color histograms (related to images X and Y) with respect 
to color c can be expressed by R c as a ratio between the dissimilitude DS* and the 
similitude S c as follows: 

DS* 

R c = c , where S c = min(y x (c),V^ (c)) (6) 

In order to better highlight similitude and dissimilitude between two images w.r.t. to 
color c, we propose a new similarity distance as a combination of the two components 
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DS^ m and Sk, m ,n ■ The following two formulae express the similarity distance for 
color bins and color histograms respectively: 

( DS 1 + log(Rk,m,n )) if Sk,m,n > D S k m n 
Ek,m,n = < (7) 

{ DS k, m ,n otherwise 

where Sk,m,n is given by Equation 1 1 

16 14 11 

£ =(EEE(W) 9 (8) 

k — 0 m = 0 n — 0 

Figure 4 shows that E leads to a better retrieval output than L\, L 2 and DS*. 

To further improve the effectiveness of the image retrieval procedure, we propose to 
incorporate the empirically superior distance measure E into two models that have been 
shown to possess powerful properties in image retrieval system in the past: the Gibbs 
random fields and Dirichlet mixture. 



3.3 Similarity Distance and Gibbs Distribution 



Gibbs distributions and Gibbs random fields are very popular in Statistical Physics and 
have been successfully used in image processing such as image enhancement, texture 
analysis, and image comparison [6]. 

A Gibbs random field (GRF) can be thought of as a random coloring of points 
on a lattice. It is therefore convenient to represent it mathematically as a family F of 
random variables taking values in a set S and parameterized through each possible color 
configuration / on the lattice (which in our case is the two-dimensional support for the 
images). Usually, as is the case here, such a distribution is defined as follows : 



P(f) 



exp — 



u(f ) 

T 



E U ( t ) 

/6F exp 



(9) 



where the denominator is just a normalizing constant, T (a measure of the entropy of 
the distribution, usually referred to as the temperature) is set to value 1 for simplicity 
and U (/) is the so-called energy function, which in our case will take the form of a sum 
over neighboring configurations to / as defined by a prescribed neighborhood system 
Af. 

U(f) = E ( E Vc(c)) (10) 

CeCjv cGCc(/) 



where Cjg is the set of clique types generated by the neighborhood system A f,Cc(f) 
is the set of instances of the clique type C in the lattice /, and Vc (•) is the potential 
function associated with clique type C [3]. 

Using a similar reasoning as before, we can define the similitude measure between 
a color bin at position k, m, n in the histogram of images X and Y as follows: 



;(min(V£ m 



1 5 ^ k,j,i ) 1 



< V K 



x V? 

k,m,n 



i=(n— 1,1, n+1) 
j=(m— l,l,m+l) 



(ID 
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Since each histogram is quantized into k colors and each color is split according to 
n chroma values and in lightness values, the similitude of a color bin V x (k, m, n ) in 
the histogram of the query image can be computed with respect to the neighborhood 
(A f = 2) of chroma and lightness of the color bin V Y (k, to, n ) related to a target image. 

While dissimilitude is computed using Equation 4, the similarity distance E of color 
k and lightness to for any value of the chroma n (in the two histograms) is calculated 
using the following equation: 



E k ,m = U(f) = 



~ DS k 

e 

e-DSZ, 



if 



” if S k , 
otherwise 



> DSt „ > 0 



( 12 ) 



16 14 

U(X,Y) = EE Ek,m (13) 

n=0 m = 0 

where S k ,m and DS k m are given by: 

11 11 Q 11 (l/<?) 

Sk,m = (min( Vk.m.ni ^ 1 Ve.m.n) + E! k,m,n ) ^ (14) 

71=0 71=0 71=0 



DS* Km = 



(| E V Km,n - E 



71=0 



71=0 



11 

E 

71=0 



(Dst m:n y) 



( 1 / 9 ) 



(15) 



The motivation behind using an average value of chroma in Equation 12 is due to the 
fact that the variation of chroma does not lead to an abrupt change to a color perception 
while a lightness variation does. 

Equation 12 is used to calculate the similarity between two histograms of images X 
and Y for a given color k and an identified lightness to, while Equation 13 is used to 
compute the distance between those histograms. 

Using the distribution described by Equation 9 and taking Formula 13 as the energy 
function together with a default value of 1 for T, we define the probability that an image 
Yj (j — 1, 2, ..., J) in the database be similar to a given query image X as: 



e^(~E£oEEo<J 

ZU exp( - Ei) 



(16) 



where E 3 is the similarity distance between the query image X and image Yj of the 
database. 

Figure 5 illustrates image retrieval using similarity distance as an integration of E 
into the exponentiation part of Gibbs distribution. 



3.4 Similarity Distance and Dirichlet Distribution 

Let V = (Vi be a vector of positive random color variables and V) the 
maximal probability value of the i th color in the two histograms of X and Y (i.e., 

Vi = max(V x m , V Y J or V = max(V k x ,V^ )) with Z\=i V < A and 0 < V < 1. 
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Based on the generalized Dirichlet mixture [1], the joint density function is given 
by: 



P(V 



r(\a\) 

r( ai ) 




i = 1 



(17) 



Vector a = (o:-| ..... a/j can be perceived as a similarity distance for events 
governed by V,, and hence a-i can be instantiated to E kt m (see Equation 12) for 

V = max(V k x m ,V£ J, or to E k = for V = max(V k x ,V^), where 

V k = Em E„ Vk.rrwn and = Em E„ V£ m , n . The parameter A is a constant. 

Figure 6 shows that the integration of the similarity distance E into the Dirichlet 
distribution leads to the best retrieval effectiveness among the six alternatives considered. 



4 Image Retrieval Effectiveness 

The performance of an image retrieval system may be analyzed according to its accuracy 
and its efficiency. While the latter is estimated based on execution time and storage 
requirements, the former corresponds to system effectiveness in retrieving the images 
that are the most closely similar to the image query. 

Indicators such as false alarm, false dismissal, precision and recall are commonly 
used for retrieval effectiveness computation. However, they do not really reflect the 
accuracy of the image retrieval system because the ranking of each displayed image is 
generally not taken into account. The normalized recall measure partially overcomes 
this limitation. 

Faloutsos et al. [2] have defined a measure for evaluating the effectiveness of QBIC 
system. For each image query, the average rank (AVRR) of all relevant retrieved images 
is computed as well as the ideal average rank of relevant images (IAVRR). The formula 
assumes that the system returns all the P relevant images which, in the ideal case 
(IAVRR), occupy the first P positions. This effectiveness measure obviously takes into 
account the ranking of relevant images. However, it ignores the deviation between the 
ideal ranking and the actual ranking of a relevant image. For example, if the system 
returns images in a completely inverse order of the ideal ranking, the following formula 
returns a perfect effectiveness value (= 1). 

P P 

AVRR i r 

Eff = — , where IAVRR = V - and AVRR = V - (18) 

IAVRR ^ P ^ P 

i — 1 i — 1 

where P is the total number of relevant images, i = (1, 2, • • • , P) is similarity image 
ranking by human expert judgement and corresponds to system image ranking (in a 
decreasing relevance order). 

In this section we propose a new effectiveness measure which overcomes the lim- 
itations indicated so far. Fet P be the total number of relevant images in the image 
database, R the total number of retrieved images ( R > P) and I’u the accuracy ratio 
defined either by Pft = r R or 1 +; o ^ r ^ , where 0 < Pr < 1. We define the (actual) av- 
erage rank as AVRR = T> + E^li ^~p~^ w hH e the ideal average rank IAVRR 
is kept unchanged [2]. 
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Table 1. Displayed images using LI, L2, DS*, E, E+Exp and E+Dirichlet distances. False alarms 
are in bold. 



Distance 


Relevant image ranking (P= 10) 


R 


Expert 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


10 


Li 


1 


2 


5 


4 


3 


6 


31 


11 


27 


12 


31 


l 2 


1 


2 


28 


2 


10 


20 


31 


5 


19 


23 


31 


DS * 


1 


2 


4 


6 


3 


5 


26 


13 


23 


12 


26 


E 


1 


2 


4 


6 


3 


5 


24 


14 


20 


10 


24 


E + Exp 


1 


2 


4 


3 


5 


7 


12 


10 


17 


11 


17 


E + Dirichlet 


1 


2 


4 


3 


6 


5 


7 


8 


13 


10 


13 



Table 2. Retrieval effectiveness computation for LI. L2, DS* (q = 1), E, E+Exp and E+Dirichlet 
distances using seven measures. 



Effectiveness method 


Effectiveness retrieval of six different metric distances 1 


Expert 


U 


l 2 


DS* 


E 


E + Exp 


E + Dirichlet 


Faloutsos 


1.0 


2.04 


2.96 


1.89 


1.76 


1.38 


1.09 


Kendall 


1.0 


0.689 


0.533 


0.667 


0.667 


0.778 


0.867 


Salton 


1.0 


0.995 


0.991 


0.996 


0.997 


0.998 


0.999 


Parkaew 


1.0 


0.883 


0.678 


0.865 


0.863 


0.908 


0.927 


Efford 


1.0 


0.512 


0.364 


0.545 


0.579 


0.743 


0.873 


Effsys (a) 


1.0 


0.167 


0.116 


0.209 


0.241 


0.437 


0.672 


Effsys (b) 


1.0 


0.368 


0.242 


0.385 


0.419 


0.604 


0.781 



In the following we define two variants of our effectiveness measure. The first one, 
called Eff or d, exploits ranking in a more accurate way than in [2] while the second, 
called Eff s?/s , improves the former by taking into account the number R of retrieved 
images needed to display the P relevant ones. The last one can be split into two distinct 
variants depending on the value given to Pr (see above). The second variant is more 
appropriate when R^> P while the first one performs better when R is a small multiple 
ofP. 



Eff ord = 



Ei=i 1 

EiLi* + E l J Lil*-r*l 



(19) 



F,ff = 



P 



V— \P 

Ei=i 1 



F,ff — 

J— '*4 fit'll fit 



s Ef.i' + Ef., I*- 

l 



E<=i * 



i + Mf)E^i* + E^N-n| 



(20a) 

(20b) 



Our preliminary experiments show that for a given recall, the highest (respectively 
the lowest) precision occurs for E + Dirichlet (respectively L2). 
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Fig. 1. Image retrieval using 
L2 distance. 



Fig. 4. Image retrieval using 
E distance (q=l). 

5 Empirical Analysis 
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Fig. 2. Image retrieval using 
LI distance. 



Fig. 5. Image retrieval using 
E+Exp distance. 




Fig. 3. Image retrieval using 
DS* distance (q=l). 




Fig. 6. Image retrieval using 
E+Dirichlet distance. 



We have conducted the empirical analysis into two steps: (i) image retrieval using each 
one of the six distances on a collection of 1069 images, and (ii) effectiveness computation 
based on the expert’s ranking of similar images and using seven effectiveness retrieval 
measures. For the first step, ten image queries were addressed to the database by four 
users (students and faculty members) and average execution time was computed. For 
each similarity measure, the system retrieves the R images needed to display the P (set 
to 10) relevant images. Some irrelevant images appear in the answer set (false alarms) 
while some relevant images will be missed (false dismissals). 

While the two steps aim at analyzing the retrieval effectiveness of each one of the 
distances, the second step helps identify the behavior of the newly proposed effectiveness 
measures, namely E S 0 rd and Eff sys . 

Figures 1 through 6 show the images ranked by the system (from top to bottom 
and from left to right) when an image query (leftmost top image) is submitted. They 
clearly show that image retrieval effectiveness is the highest when E is integrated into 
the generalized Dirichlet mixture and the lowest when the Euclidean distance is used. 
Indeed, the number of false alarms and false dismissals is the smallest for E + Dirichlet 
followed by E + Gibbs (exponentiation), followed by E, followed by DS *, and so on. 
The worst similarity ranking is provided by the Euclidean distance. 
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Table 1 confirms our preceding observations about the performance of E+ Dirichlet 
against the effectiveness of the other similarity measures. The findings remain true in 
Table 2, except for Faloutsos’s measure. 



6 Conclusion 

In this paper, we have defined two distances: the dissimilitude distance DS* and the 
similarity distance E and proposed three variants of a new retrieval effectiveness mea- 
sure. When incorporated into the Gibbs random field and particularly to the generalized 
Dirichlet mixture, the distance E appears to be a good similarity measure. Empirical 
analysis of six similarity measures is conducted on color histograms of an image database 
and shows that retrieval effectiveness is the highest for E + Dirichlet and the lowest 
for the Euclidean distance. 

Our current activities concern the design of new algorithms for color layout extraction 
(including spatial relationships identification) and image segmentation in order to get a 
more discriminating power in image retrieval and hence increase our system retrieval 
effectiveness. 
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Multimedia Retrieval Using Multiple Examples 
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Abstract. This paper presents a variant of our generative probabilistic 
multimedia retrieval model. Evaluation on the TRECVID 2003 collec- 
tion shows the new variant, a document generation approach, is suitable 
for information needs with multiple examples. Moreover, in combination 
with textual information, the new variant outperforms the original one. 



1 Introduction 

A commonly used paradigm in image and video retrieval is that of querying 
by example (QBE). An example document (image or video) is presented to the 
search engine, and similar documents are requested. A slightly modified form of 
this paradigm is adopted in the TRECVID video retrieval benchmarking effort 
[1] . An information request is called a topic. It consists of a textual description of 
the multimedia need accompanied by one or more image and/or video examples. 
The goal is to return a ranked list of shots that meet the information need. 

Combining multiple visual examples to return one set (or ranked list) of 
similar documents can be problematic. Consider for example the topic shown 
in Figure 1. Here the information need is for shots of points being scored in 
basketball. The need is clarified by 6 different examples, some of them close- 
ups of the ball going through the basket, others showing overview shots of the 
playing court. No document will be highly similar to all examples. Clearly, we 
are looking for some sort of OR - functionality here; a query result should be 
similar to any of the examples, but not necessarily to all. 

A common approach to handling multiple queries is to run separate queries 
for each example and combine the results afterwards. In such an approach, the 
final score for a document is a function of either the scores or the ranks for the 
individual examples [2,3,4]. It is however far from trivial to choose a combination 
function that works well for a variety of queries. 




Fig. 1. Topic 101: ‘Find shots of a basket being made’. 

The present work leaves this approach and captures all the different facets of 
a set of query examples in a single topic model. For retrieval, all documents in a 
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collection are compared to this single topic model and ranked accordingly. The 
rest of the paper is organised as follows. Section 2 describes a generative proba- 
bilistic approach to information retrieval. Section 3 discusses how this approach 
can be applied to image and video retrieval. Section 4 shows experimental results 
and Section 5 summarises our main conclusions. 



2 Generative Probabilistic Retrieval 



Following Sparck Jones et al. [5], and Lafferty and Zhai [6], we introduce random 
variables D and Q to represent a document and a query, and an event r to 
represent ‘relevant’, and try to answer the following “Basic Question”: What is 
the probability that this document is relevant to this query? This probability of 
relevance, P(r| D, Q), can be estimated indirectly using Bayes’ rule: P(r| D, Q) = 
P(D,Q\i')P(r)/P(D,Q). For ranking documents, we may avoid estimation of 
P(D,Q) using the odds of relevance: 

P(r\D,Q) _ P(D,Q\r)P(r) 

P{f\D,Q) P(£>,0|f)P(f)’ 1 j 

where f means not r. In the following, Q and D are assumed independent in the 
unrelevant case (f). 

Assumption 1. P{Q 1 D\f) = P(Q\r)P{D\f) 

Factoring the conditional probability P (D, Q\r) in different ways leads to two 
distinct, though probabilistically equivalent, models [6]. One model corresponds 
to query generation , and the other to document generation. 

The query generation model results from factoring P(D, Q\r ) as P(D, Q\r) = 
P(Q\D,r)P(D\r), giving the following odds of relevance: 

P(r|AQ) _ P(£>,Q|r)P(r) _ p(olD , P (D\r) P (r) 

P(f\D, Q) P(D,Q\f)P(f) Wl ’ j P{D\r^ P(Q|f)P(f) < 1 ’ 

prior odds independent of D 

Since the goal is to rank documents, we can ignore the document independent 
terms. Also, we assume equal priors, i.e., a priori all documents are equally likely. 
This results in the following retrieval status value (RSV) for a document D: 

RSV(D) = P(Q\D,r) (3) 



The document generation approach results from factoring P(D,Q|r) as 
P(D,Q\r) = P(D|Q,r)P(Q|r), arriving at a different equation for the odds of 
relevance: 



P(r\D,Q) 

P{f\D,Q) 



P(AQ|r)P(r) 

P(D,Q\f)P(f) 



P (D\Q,r) P(Q|r)P(r) 
P(D\f) ' P(Q|f)P(f) 




independent of D 



( 4 ) 



Ignoring all factors independent of D for ranking gives the following RSV: 



RSV(.D) 



P(D\Q,r) 

P(D\f) 



( 5 ) 
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3 Generative Multimedia Retrieval 

The next step is to define how to estimate the probabilities P(Q|Z?, r), P(-D|Q, r) 
and P(D|f). Documents in our case are video shots and queries are either (sets 
of) images or shots. We choose to represent a shot by a representative keyframe, 
thus all queries and documents are images. A variant in which temporal aspects 
are incorporated is presented in [3]. We estimate the (conditional) probabilities 
of queries and documents, by building a statistical model for each image. Other 
Generative approaches for multimedia retrieval include [7,8]. 

3.1 Gaussian Mixture Models 

The model assumes that an image is the outcome of a random process that 
generates n-dimensional feature vectors x = (aq,... ,x n ), where each feature 
vector describes a small, square block of pixels. The retrieval framework itself is 
independent of the specificities of the features; we have used DCT coefficients 
and x- and y-coordinates to capture colour, texture and position of a pixel block. 
In the remainder, the term sample is used to refer to both the feature vectors 
and the pixel blocks they describe. One or more images are represented as a bag 
of samples X = {x 1 ,x 2 , ■ ■ . , xn s }. 

The samples are assumed to be generated by a mixture of Gaussian sources, 
where the number of Gaussian components Nq is fixed for all images in the 
collection. The Gaussian mixture model (GMM) is fully described by a set of 
parameters 6 = ( 6 1 , . . . , 6n c ) defining the different components. Each compo- 
nent Ci is described by its prior probability P(C)), the mean fii and the variance 
Si, thus Oi = (P (Ci), Hi, Si). Details about estimating these parameters are de- 
ferred to Section 3.2. Equation 6 defines the probability of drawing one sample 
x from a GMM with parameters 6. 



N c 

p(*i<o = Ep(tfi) 

i = 1 



^/(2ir) n \Si\ 



0 -§(a=-Mi) 






( 6 ) 



The probability of drawing a bag of samples is simply the joint probability of 
drawing the individual samples: 



N s 

v{x\o ) = np^i 0 ) ( ? ) 

i= 1 



3.2 Parameter Estimation 

One way to look at mixture modelling for images is by assuming an image can 
show only so many different things, each of which is modelled by a Gaussian 
distribution. Each sample in a document is then assumed to be generated from 
one of these Gaussian components. This viewpoint, where ultimately each sample 
is explained by one and only one component, is useful when estimating the 




Multimedia Retrieval Using Multiple Examples 347 



GMM parameters. The assignments of samples Xj to components C; can be 
viewed as hidden variables, so the Expectation Maximisation (EM) algorithm 
[9] can be used. This algorithm iterates between estimating the a posteriori class 
probabilities for each sample (the E-step) given the current model settings, and 
re-estimating the components parameters based on the sample distribution and 
the current sample assignments (the M-step): 

E-step: Estimate the hidden assignments hij of samples Xj to components C,, 
for all samples and components. 



hij — P ( Ci 1 Xj ) 



p(a; J -|C7 i )P(C70 

Ef=iP(*;|Cc)P(Cc) 



(8) 



M-step: Update the component’s parameters to maximise the joint probability 
of component assignments and samples. 0 new = argmax# p(X, H\9), where H 
is the matrix with all sample assignments hij. More specifically: 



^• eW = 



jinew 

p(C)) new = 



hijXj 

^2 j hij 5 

52 j hij 5 

N £7 ^ 



N 



j 'Hj 



(9) 

(10) 

( 11 ) 



The algorithm is guaranteed to converge to a local optimum. In previous exper- 
iments we found EM initialisation hardly influences the retrieval results [10]. 



3.3 Smoothing 

Typicalities are more interesting than commonalities. Smoothing is a technique 
for explaining the common query terms, to reduce their influence on the ranking 

[11] . The estimates of the GMM are smoothed using interpolation with a general, 
background distribution - this technique is known as Jelinek-Mercer smoothing 

[12] . The smoothed version of the likelihood for a single sample x becomes (cf. 
Equation 6): 



Psmooth(^l^) — K 



' N c 

E p ( c *) 

.i=i 



^/{2TT) n \S i 






+ (1 - k)p(»), 

(12) 



where k is a mixture parameter that can be estimated on training data 
with known relevant documents. The background density p(®) is estimated by 
marginalisation over all document models in a reference collection T>\ 

P(*) = P( a; l 0 d)P(d) (13) 

de T> 

The reference collection T> can be the current collection, a representative sample 
of that, or, another comparable collection. 
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3.4 GMMs and the Retrieval Framework 

In the GMM approach, each document D has 2 representations: a set of sam- 
ples X D and a Gaussian mixture model 0 D (the same holds for queries Q). To 
relate this to the conditional probabilities introduced in Section 2, we estimate 
P(A\B,r) as the probability that the model of B {Ob) generates the samples of 
A (Xa)- Furthermore, to estimate P(A|f) we use the joint background density of 
all samples of Xa (cf. Equation 13). Thus, the retrieval status values for query 
generation (Eq. 3) and document generation (Eq. 5) are estimated as 



RSVqgen (£>) = P{Q\D, r ) = P(X Q \ 0 D ) 



RSVogen (D) 



P(D\ Q,r) 
P{D\f) 



P{X d \Oq) 

P(X d ) 



(14) 

(15) 



4 Experiments 

We evaluated the query and document generation variant of the generative prob- 
abilistic retrieval framework on the TRECVID 2003 search task [1]. For each 
document in the collection, and for each set of query examples, we build an 
8-component GMM as described in Section 3.2 ( Nc = 8). Since we are inter- 
ested in multiple-example queries, we regard samples from all available query 
images as a single set of query samples. We study two variants to represent the 
sets of query samples Xq. The first variant uses all available query samples, the 
second only those samples occurring in manually selected, interesting regions. 1 
The same sets of samples are used to build topic models Oq for the document 
generation approach. 



4.1 Results 

We have two model variants (query generation and document generation), and 
two ways of building query sample sets (full and regions) . This amounts to four 
different system variants. Each of these is evaluated in isolation, as well as in 
combination with textual information. In the multimodal runs, we use a sepa- 
rate textual model, similar to the query generation approach described before. 2 
For each shot a textual model is built from speech transcripts associated with 
the shot. 3 Assuming independence between the modalities, visual and textual 
models are used separately, and scores are combined afterwards. For details see 
[14]. Table 1 shows results for different experimental settings (the last column is 
explained in Section 4.2). Using full example images, query generation outper- 
forms document generation, but if we select regions, the situation is reversed. 

1 Our manually selected query regions are available from 
http : / /www . cwi .nl/projects/trecvid/trecvid2003/. 

2 A document generation approach for the textual part is problematic, since the short 
text queries provide insufficient data to estimate proper topic models from. 

3 The speech transcripts have been kindly provided by LIMSI [13]. 
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Looking at the average precision scores per topic, rather than only at the mean, 
and inspecting the returned ranked lists for the different models, interesting dif- 
ferences are found. The query generation approach seems to be good at finding 
(near) exact matches, and is successful mainly when the set of examples is homo- 
geneous (e.g. highly similar CNN baseball shots, or Dow Jones graphics). When 
a set of examples is less homogeneous, often a single example dominates the 
query generation results. Figure 2 shows this effect. In the document generation 
approach, the topic models seem to have learned important common aspects of 
the query examples, thus all examples contribute to the combined result (see 
Figure 2), and more generic matches are found. The fact that common aspects 
are learned, could be an explanation why selecting regions helps here. When 
a user indicates important regions, the topic models will be more focused and 
retrieve better documents. In the query generation approach, selecting regions 
does not help, since exact matching relies heavily on background similarity. 



Table 1. Mean average precision for visual information only, and for a combina- 
tion with text (MM) (text only: .130). Signs between brackets indicate a significant 
in/decrease of Dgen in comparison to the corresponding Qgen variant; Wilcoxon signed 
rank test, at confidence levels of 95% (+/-) and 99% (++/-). 
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Query Generation Document Generation 



Fig. 2. Visualisations of the top 50s for the rocket launch query (topicOlOT). Each 
row represents retrieved documents for one run. Documents that are within the 
top 50 for the multiple-example run (top rows), are assigned a colour code, doc- 
uments within the top 100 for this run are represented as grey rectangles. If a 
document from the multiple example run appears in another result, it is repre- 
sented the same. Documents not in the top 100 for the multiple example run 
are not represented anywhere. Plots created using NIST’s BeadPlot tool (see 
http : //www. itl .nist . gov/ iaui/894 . 02/projects/beadplot/). 
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In combination with textual information, the region based document gener- 
ation approach is better than any query generation variant. The lower perfor- 
mance of the query generation approaches can be explained because the near- 
exact matches on visual content interfere with the textual ranking. In the docu- 
ment generation approach however, the visual information seems to provide the 
generic visual context, while the textual information zooms in on specific results. 
For example, for topics that ask for airplanes, helicopters or rocket launches, the 
visual model captures the fact that we are looking for an object against a back- 
ground of sky. The textual information can then help to distinguish between 
specific objects. Figure 3 shows an example. 




Fig. 3. Document generation results (top 5) for Rocket launch query (topicl07). The 
visual information sets the context (top row, sky background) adding texual informa- 
tion fills in specifics (bottom row, rockets) 



4.2 Automatically Selecting Regions 



It is clear that selecting regions is useful for the document generation approach. 
Rather than selecting these manually, it is possible to automatically select im- 
portant parts of an example image. The main idea is to select those parts of the 
example that differ most from the average image. Samples that are likely to be 
generated by any model should not influence the training process too much. A 
similar approach for text retrieval is studied in [15] . 

This can be achieved by incorporating background probabilities (Equation 
13) in the training process. Again, hidden variables hij indicate the assignment 
of samples Xj to components Ci , but now samples can also be assigned to the 
background, indicated by hsGj- The EM-algorithm can be applied as before. 
The E-step changes to: 



hij — P ( Ci | Xj ) 



hsGj = P( BG\xj ) 



P(^|Q)P(Q) 

E^i p(a i |C c )P(C' c ) + P(* J -)P (BG) 

v( x j) p (BG) 

Ec=i p(xi\C c )P(C c ) + p(xj)P(BG) ’ 



(16) 

(17) 



where P(BG\xj) is the posterior probability that Xj is from the background, 
and P(BG) is the prior probability that we see background samples from the 
current model. 
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The M-step, does not update the background model p(a;). All we update are 
the component parameters (like in Equations 9,10, and 11), and the background 
prior (P(-BG)) for the current model. 

P (BGr™ = ^hBGi (18) 

3 

Since common samples will be assigned to the background, only distinguishing 
samples are used in estimating the components’ parameters. Figure 4 shows an 
example image and the regions that are automatically selected to build the model 
from. 




Fig. 4. Sphinx example. Original image (left) and samples selected by EM algorithm 
(right). 



The rightmost column of Table 1 shows the results for the new EM vari- 
ant. A small (1%) sample from a comparable collection (the TRECVID 2003 
development set) was used to estimate the background probabilities (P(ar,-)) for 
the query samples. Clearly, using background probabilities during training helps. 
Automatically selecting regions using the new EM variant is almost as good as 
manually selecting important regions. Automatically finding distinguishing parts 
within manually selected regions gives another improvement. 

5 Conclusions 

This work presented two ways of applying generative probabilistic retrieval mod- 
els to the problem of video retrieval: a query generation approach and a doc- 
ument generation approach. We showed that the query generation approach is 
not good at handling multiple-example queries. Usually, there is no document 
model in the collection that is likely to generate all available visual examples. In 
such cases, the query generation approach results in a model that explains, only 
one of the examples very well. 

The document generation approach on the other hand, has to capture all 
information available in the examples in a limited number of Gaussian compo- 
nents. Therefore, it captures mainly things that are present in all examples, and 
thus builds a model that describes the commonalities shared by the examples. 
This leads to results that take all different examples into account. Often the 
things captured in the query models are of a generic, context-like nature (e.g., 
sky, grass, water). This turns out to be very useful in combination with textual 
information, where the results are far better than anything obtained so far using 
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the query generation approach. We showed also, specifically for the document 
generation approach, that indicating important regions in example images is use- 
ful for retrieval. Our automatic approach yields results comparable to manual 
region selection. Automatically selecting important parts within manually cre- 
ated regions gives another improvement (though slight) on the scores over using 
the user’s manual selection as is. 

Future work on the document generation model should prove whether the 
results would be more like exact matches when multiple-example queries are 
modelled by more components. Another plan is to investigate automatic selection 
of samples for the query generation approach. 
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Abstract. During an interactive image retrieval process with relevance feedback, 
kernel-based or boosted learning algorithms can provide superior nonlinear mod- 
eling capability. In this paper, we discuss such nonlinear extensions for biased 
discriminants, or BiasMap [1,2]. Kernel partial alignment is proposed as the crite- 
rion for kernel selection. The associated analysis also provides a gauge on relative 
class scatters, which can guide an asymmetric learner, such as BiasMap, toward 
better class modeling. We also propose two boosted versions of BiasMap. Unlike 
existing approach that boosts feature components or vectors to form a composite 
classifier, our scheme boosts linear BiasMap toward a nonlinear ranker which is 
more suited for small-sample learning during interactive image retrieval. Exper- 
iments on heterogeneous image database retrieval as well as small sample face 
retrieval are used for performance evaluations. 



1 Introduction 

Within the last decade, numerous relevance feedback algorithms have been proposed 
for learning during interactive image retrieval [3,4, 5, 6,7, 8]. For recent surveys see [9, 
2]. Most early relevance feedback algorithms assumed Gaussian distribution for image 
classes. To relax this restrictive assumption, recent developments use either kernel ma- 
chines or boosting methods to capture nonlinearities. Examples include kernel-based 
support vector machines [10,11,12,13], adaptive quasicomformal kernel (AQK) method 
[14], kernel-based BiasMap [1,2], Boosting approaches including FBoost or VBoost [4, 
15], or a constrained similarity measures using SVM or AdaBoosting [16]. 

However, None of above addresses kernel selection in a principled way. In this paper, 
we propose kernel partial alignment for measuring kernel “fitness” which is especially 
suited for small-sample learning during image retrieval. It also provides a gauge on 
relative class scatters, which can guide a biased learner, such as BiasMap, toward better 
class modeling. 

We also propose two boosting algorithms as alternatives to the kernel-based ver- 
sion for nonlinear distributions. Unlike existing approach that boosts individual feature 
components or vectors to form a composite and symmetric classifier [4,15], our scheme 
boosts multiple linear rankers (each of which uses all the available features) toward a 
nonlinear ranker which is more suited for information retrieval tasks [17]. 

It should be noted, however, that in content-based image retrieval (CBIR), the success 
of the learning module is contingent upon the effectiveness of the content representation 
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module. To focus on the issue of learning, we assume that a subset of the visual features, 
with some transformations, are relevant and informative. Although we do expect the 
features to be superfluous for any given retrieval task. 

We will first briefly revisit the biased discriminant analysis (BDA) algorithm, upon 
which we will test our proposed kernel and boosting techniques. We use the term BiasMap 
as a shorthand for the class of algorithms based on linear or kernel BDA. 

2 BiasMap Revisited 

Given a set of positive and negative examples fed-back by the user, the aim is to find a 
ranking function from which we want not only that positive examples be ranked higher 
than the negative ones, but also that the positive examples be among top k returns, with 
a minimal k. In [ 1 ] , the assumption was that the user is only interested in one class out of 
an unknown number of many. Thus the negative examples can come from an uncertain 
number (p) of classes. If we could get sufficient training examples there is no need to 
distinguish so-called ( 1 + p)-class learning from traditional two-class learning. However, 
when the training sample is small, such a treatment becomes useful. 

2.1 Biased Discriminant Analysis (BDA) 

An optimal discriminative transform matrix W op t was formulated as follows : 

I W T SyW\ 

W opt = W gmax lwTSxWl (1) 

where S x and S y are the scatter matrix estimates: 

N X Ny 

s x = ^2(xi - m x )(xi - m x ) T , Sy = - m x )(yi - m x ) T ( 2 ) 

i=l i= 1 

{xi, i = 1, . . . , N x } denote the positive examples, and {r/j, i = 1, . . . , N y } denote 
the negative examples. Each element of these sets is a vector of length n, which is the 
dimension of the feature space. m x is the mean vectors of the sets {a:,}. The term “biased 
discriminant analysis” shall not be confused with that of [ 1 8] where only the statistical 
bias of the original discriminant analysis was studied; while [1] also addresses statistical 
bias, it further explicitly models class asymmetry stemmed from the subjective bias of 
the user. 

2.2 Kernel BDA (KBDA) 

The rationale of kernel BDA, or of kernel machines in general, was to apply the original 
linear algorithm in a feature space , T, which is related to the original space by a non- 
linear mapping 



(f> : C —> T | x — > 4>{x) 



( 3 ) 
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where C is a compact subset of IR”, such that linearly non-separable configurations 
becomes separable in T . However, this mapping can be formidably expensive thus 
will not be carried out explicitly, but through the evaluation of a kernel matrix K with 
components k(xi,Xj) = (f> T (xi)(f>(xj). This is the same idea adopted by nonlinear 
support vector machines [19], kernel PCA, and kernel discriminant analysis [20,21], 

In [1], it was shown that the optimal nonlinear feature space mapping is achieved 
through a weighted summation of kernel distances to the training points. The weights 
are the solutions to a generalized eigenanalysis problem. The retrieval process is then 
a simple search of nearest neighbors in this transformed space. However, the issue of 
kernel or kernel parameter selection was not studied in [1], This is our next topic. 

3 Kernel Partial Alignment 

In this section, we introduce a measure called kernel partial alignment for kernel or 
kernel parameter selection. The idea is an adaptation of the kernel alignment of [22] for 
BiasMap. The aim is to measure the “alignment” between a Gram matrix and an ideal 
target matrix, and use the score as a goodness measure of the kernel. 

Definition 1 (Kernel Alignment [22]). The alignment of a kernel k on a sample S with 
the target matrix l l T is: 



(K,lF) r (K,lF) p 
||(P'|IfI|A1I p N ||A'|| p 

where l is the class label vector taking values from {—1,1}, K is the Gram matrix for S 
using k; and the Frobenius matrix inner product (|| K\\p is the associated matrix norm) 
is adopted: (P, Q) F = PijQij- 

i,j 

The kernel alignment A provides a measure for selecting/combining kernels. 

Using similar notations we define kernel partial alignment as follows: 

Definition 2 (Kernel Partial Alignment). The partial alignment of kernel k with the 
ideal target matrix on a sample S = , . . . , xn x , yi , . ■ . , yN v }, with Xi 's being the 

positive set, is: 

,p _ ( \K X x 1 Kxy\ ) [Lxx > L X y ] ) p 

xly ~ \\[K^,K xy ]\\ F \\[L xx ,L xy ]\\ F 

where K xx is the N x by N x kernel matrix with elements k{xi,xf), K xy is the N x by 
N y kernel matrix with elements k(xi, yj); L xx and L xy are of the same size as K xx and 
K yy , and with elements 1 and —1, respectively. (In case the data is unbalanced, the —1 
shall be replaced by —N x /N y .) 

Notice that A p by definition is asymmetric with respect to x and y. A 1 ' is in fact an 
alignment based on part of the overall Gram matrix K and part of the target matrix ll T , 
hence the name. While the original definition of kernel alignment accumulates within- 
class similarities for both classes, the partial definition accumulates only the similarities 




